Introduction
I’ve had numerous clients contact me about their servers having memory and CPU issues. Upon further investigation, the issue seems to be bot’s scraping the Event’s Calendar pages, and following all of the links within the page.
Now, this wouldn’t be a problem for a site that has caching and regular pages, as the pages would be cached and the data returned without breaking a sweat as PHP is never involved.
Unfortunately with The Events Calendar, the links being scraped contain query strings and these aren’t natively cached. Furthermore, the bot’s are going through pagination and any links they can find on the page which might be the links to the iCal URL.
Bots Blocking List
This is the master list of common bot user-agents.
ahrefsbot amazonbot baiduspider barkrowler bingbot blexbot bytespider claudebot crawler4j curl dataforseo datanyze dotbot duckduckbot exabot facebookexternalhit googlebot gptbot liebaofast linkedinbot mb2345browser meta-externalagent micromessenger mj12bot petalbot pinterest qwantbot scrapy semrush semrushbot seznambot slurp sogou twitterbot wget yandexbot zh_cn
Facebook Link Previews
Just be aware that if you want Facebook link previews to work you must make sure that you remove the following user agents
facebookexternalhit meta-externalagent
Blocking Scraping Bots accessing Events Calendar Pages
Block with Cloudflare
The solutions is pretty simple, we just need to block know bots from accessing the Events Calendar URL’s. Here’s a Cloudflare rule that does just this.
(http.request.uri.query contains "ical" and cf.client.bot) or (http.request.uri.query contains "eventDisplay" and cf.client.bot) or (http.request.uri.path contains "/events" and cf.client.bot)
The above will block requests containing the following
- ical
- eventDisplay
- /events
This seems to encapsulate most of the requests that the bots are scraping.
Blocking Scraping Bots with Nginx
If you don’t use Cloudflare for your site, you can block the requests with Nginx. Here’s an example. You will have to adapt it.
Step 1 – Identify the Bots via Maps in http section
Add this to your http section of your Nginx Config.
GridPane: Create and add to new file /etc/nginx/conf.d/block-scraping-events.conf
# See /etc/nginx/extra.d/block-scraping-root-context.conf or /var/www/domain.com/nginx/block-scraping-events-root-context.conf http { variables_hash_max_size 2048; # Define bot user agents if needed map $http_user_agent $is_scraping_bot { default 0; ~*ahrefsbot 1; ~*amazonbot 1; ~*baiduspider 1; ~*barkrowler 1; ~*bingbot 1; ~*blexbot 1; ~*bytespider 1; ~*claudebot 1; ~*crawler4j 1; ~*curl 1; ~*dataforseo 1; ~*datanyze 1; ~*dotbot 1; ~*duckduckbot 1; ~*exabot 1; ~*facebookexternalhit 1; ~*googlebot 1; ~*gptbot 1; ~*liebaofast 1; ~*linkedinbot 1; ~*mb2345browser 1; ~*meta-externalagent 1; ~*micromessenger 1; ~*mj12bot 1; ~*petalbot 1; ~*pinterest 1; ~*qwantbot 1; ~*scrapy 1; ~*semrush 1; ~*semrushbot 1; ~*seznambot 1; ~*slurp 1; ~*sogou 1; ~*twitterbot 1; ~*wget 1; ~*yandexbot 1; ~*zh_cn 1; } # Map to handle request URIs map $request_uri $block_scraping_uri { default 0; "~*ical" 1; "~*eventDisplay" 1; "~*\/events" 1; } # Combine both conditions into one variable map "$is_scraping_bot$block_scraping_uri" $block_scraping_request{ default 0; 11 1; # Both $is_scraping_bot and $block_scraping_uri are true } }
Step 2 – Apply blocking to your site
Add the following to your location /
section.
GridPane: Create a file called /etc/nginx/extra.d/block-scraping-root-context.conf
which will affect all sites. For a specific site set create a file called /var/www/domain.com/nginx/block-scraping-events-root-context.conf
# See /etc/nginx/conf.d/block-scraping-events.conf if ($block_scraping_request) { return 403; # Forbidden }
Blocking Scraping Bots with Litespeed/Openlitespeed
Here is the appropriate .htaccess rules for blocking bots via Litespeed/Openlitespeed
# Define partial matches for known bots RewriteCond %{HTTP_USER_AGENT} ahrefsbot [NC,OR] RewriteCond %{HTTP_USER_AGENT} amazonbot [NC,OR] RewriteCond %{HTTP_USER_AGENT} baiduspider [NC,OR] RewriteCond %{HTTP_USER_AGENT} barkrowler [NC,OR] RewriteCond %{HTTP_USER_AGENT} bingbot [NC,OR] RewriteCond %{HTTP_USER_AGENT} blexbot [NC,OR] RewriteCond %{HTTP_USER_AGENT} bytespider [NC,OR] RewriteCond %{HTTP_USER_AGENT} claudebot [NC,OR] RewriteCond %{HTTP_USER_AGENT} crawler4j [NC,OR] RewriteCond %{HTTP_USER_AGENT} curl [NC,OR] RewriteCond %{HTTP_USER_AGENT} dataforseo [NC,OR] RewriteCond %{HTTP_USER_AGENT} datanyze [NC,OR] RewriteCond %{HTTP_USER_AGENT} dotbot [NC,OR] RewriteCond %{HTTP_USER_AGENT} duckduckbot [NC,OR] RewriteCond %{HTTP_USER_AGENT} exabot [NC,OR] RewriteCond %{HTTP_USER_AGENT} facebookexternalhit [NC,OR] RewriteCond %{HTTP_USER_AGENT} googlebot [NC,OR] RewriteCond %{HTTP_USER_AGENT} gptbot [NC,OR] RewriteCond %{HTTP_USER_AGENT} liebaofast [NC,OR] RewriteCond %{HTTP_USER_AGENT} linkedinbot [NC,OR] RewriteCond %{HTTP_USER_AGENT} mb2345browser [NC,OR] RewriteCond %{HTTP_USER_AGENT} meta-externalagent [NC,OR] RewriteCond %{HTTP_USER_AGENT} micromessenger [NC,OR] RewriteCond %{HTTP_USER_AGENT} mj12bot [NC,OR] RewriteCond %{HTTP_USER_AGENT} petalbot [NC,OR] RewriteCond %{HTTP_USER_AGENT} pinterest [NC,OR] RewriteCond %{HTTP_USER_AGENT} qwantbot [NC,OR] RewriteCond %{HTTP_USER_AGENT} scrapy [NC,OR] RewriteCond %{HTTP_USER_AGENT} semrush [NC,OR] RewriteCond %{HTTP_USER_AGENT} semrushbot [NC,OR] RewriteCond %{HTTP_USER_AGENT} seznambot [NC,OR] RewriteCond %{HTTP_USER_AGENT} slurp [NC,OR] RewriteCond %{HTTP_USER_AGENT} sogou [NC,OR] RewriteCond %{HTTP_USER_AGENT} twitterbot [NC,OR] RewriteCond %{HTTP_USER_AGENT} wget [NC,OR] RewriteCond %{HTTP_USER_AGENT} yandexbot [NC,OR] RewriteCond %{HTTP_USER_AGENT} zh_cn [NC] # Block URLs with specific query strings or paths for these bots RewriteCond %{QUERY_STRING} (ical|eventDisplay) [NC,OR] RewriteCond %{REQUEST_URI} ^/events/ [NC] RewriteRule ^ - [F,L]
Confirming Blocks
You can check your access log to see if you’ve missed any user agents
tail -f /var/www/domain.com/logs/domain.com.access.log | grep -v 403 | grep event
User Agents
Collecting User Agents
If you’re on GridPane using the Openlitespeed logs, you can use the following command to get a list of unique User Agents accessing eventDisplay
cat /var/www/domain.com/logs/domain.com.access.log | grep "eventDisplay" |awk -F'"' '{print $7}' | sort | uniq
Valid User Agents
I’ve listed some common bot User Agents above, however there are also some to be aware of as actual users.
- iOS/17.5.1 (21F90) dataaccessd/1.0 = iOS Calendar App
Changelog
- 01-02-2024 – Added section about Facebook Link Previews
- 01-01-2025 – Updated list of user agents and updated Nginx and .htaccess configs.
- 08-12-2024 – Updated Nginx to be correct syntax, also added more user agents and GridPane files.
- 08-18-2024 – Updated rules to include PetalBot