Limit bots access to our web site by hours

Lately most of the client websites I am working with are with a custom system on their own server, either a VPS with 2 GB of RAM and 1 processor or a dedicated one with 128 GB of RAM and 32 processors.

Whatever the size of the server, I manage them with GridPane, today without a doubt the best panel to set up a really professional WordPress stack. Its price starts at $200 per month and we must add the price of each server we use, but if you are serious about this, it is really worth it.

Of course it is not a service for the vast majority of installations, because we will also have to adjust it conveniently for each client with intensive use of the console and configuration files and not in a graphical way, so if the console is not your thing ….

The case is that a client for whom I administer the server, a Vultr High Frequency 4 GB RAM, with a WooCommerce with about 14,500 products; it was giving excellent performance in terms of load times (page and object caching with Redis), but most of the time I had the CPU at 150%, this already for more than a month.

This data gives me an idea of the good performance of Vultr, at least on High Frequency Compute servers, since with other VPS would have already given errors, stopped working, 502 status, etc … and this server was still running without flinching.

After a little digging through the logs and seeing everything more or less correct, well, many plugins to improve, but that was the subject of a second phase; as I say, after a search, I found a very high traffic with the following user agent identifier “Mozilla/5.0 (compatible; Pinterestbot/1.0; +http://www.pinterest.com/bot.html)“.

Checked the IP of the requests and the ASN(AS14618 AMAZON-AES), I verify that it is the authentic Pinterest bot.

The DNS are managed from Cloudflare, from where I also run a series of custom rules in the firewall to block certain requests (I will write an article on this topic shortly). So the first step is to block requests from Cloudflare, something very simple:

Go to Firewall -> Tools -> User agent blocking and create a blocking rule. As name I put “Pinterest Bot”, as action I select challenge (instead of blocking it) and as user agent we put“Mozilla/5.0 (compatible; Pinterestbot/1.0; +http://www.pinterest.com/bot.html)“.

I activate the rule, monitor live server logs with:

# tail -f -n30 /var/log/nginx/domain.com.access.log

And right away I see that the Pinterest bot requests start to disappear, they are being blocked by Cloudflare’s firewall, well, actually it is sending you the challenge and if you don’t pass it, the request is blocked.

And I see that the CPU returns to values around 30%, but there is a problem. Pinterest is being used for commerce and you should enable your bot to stop by from time to time to scan products and their photos.

Pinterest does not allow us to select when your bot will come and how often to crawl. But it does respect the Craw-delay directive respecting values up to a maximum of 1 https://help.pinterest.com/en/business/article/pinterest-crawler, so the first action is to modify the robots.txt file and add the following:

User-agent: Pinterestbot
Crawl-delay: 1

Now ideally I’d like to say, “Pinterest, you can scan my website between 10pm and 8am, then don’t come…” so, after giving it a whirl I came up with the idea:

I have a rule created in Cloudflare that blocks Pinterest requests, and that I can turn on and off, right, and Cloudflare has an API, right?

Here we go. After looking at the documentation a bit and creating an API key for the client domain in question, I did a few tests with Postman and quickly got the rule turned on and off using the API and the newly created key.

Well, almost there, now we can make the call from the server with a cron, but there is a service that I already told you about in another article and that I think is wonderful, besides being free cron-job.org

I go into cron-job and check that I can make PUT requests, that I can send custom headers and a JSON in the body of the request… well, that’s it. I schedule two cron from the service. One to activate the rule every day at 8am, which will launch the challenge to Pinterest requests and another cron at 10pm which will deactivate the rule so all requests from Pinterest will be allowed.

Cron jobs enable disable Cloudflare rule

Both cron jobs are the same, the only difference is the paused parameter which will be true to stop the job and false to keep it active.

The part of the API that we are interested in this case is the Update UserAgent Rule, so in the new cron job, first we put the url https://api.cloudflare.com/client/v4/zones/IDENTIFICADOR-DE-ZONA/firewall/ua_rules/ID-DE-LA-REGLA

For the zone identifier, if we deploy API under our rule in Cloudflare, it shows us the url with that identifier and if we make a GET request to https://api.cloudflare.com/client/v4/zones/IDENTIFICADOR-DE-ZONA/firewall/ua_rules with the necessary identification parameters we will get the ID of the rule.

Next, we activate the cron task at the desired time and check the options we want, such as saving the response of the requests, warning us in case of failure, etc. and we go to Advanced.

In the advanced tab we add two custom headers, one to authenticate with the API Token:

“Authorization” and value“Bearer la_api_key_generated“.

And another one to indicate that the content will be JSON:

“Content-Type” and value“application/json“.

In method we change from GET to PUT and in Request body the JSON data to send:

{
	"id": "id-de-la-regla",
	"description": "Bot de Pinterest",
	"paused": false,
	"mode": "challenge",
	"configuration": {
		"target": "ua",
		"value": "Mozilla/5.0 (compatible; Pinterestbot/1.0; +http://www.pinterest.com/bot.html)"
	}
}

With this request we would activate the rule and with another one in which we change from "paused": false to "paused": true we would deactivate it again.

Remember that if you manage several Cloudflare accounts with authorization (in my case I have several customer accounts managed from mine), to create the API token, you must authorize only their account. And inside it only the domain that you want and in the permissions Edit Firewall Services.