I've noticed that scrapers from companies like OpenAI and Anthropic seem to ignore the rules in robots.txt. They also often target event pages looking for bizarre dates like 2139-13-45, which ends up being an exhausting process for my server. I'm looking for a straightforward solution to tackle this issue. It seems like more robust tools like mod_security are dated and often overly complicated for my needs, especially for smaller sites on shared hosting. For larger sites, I've considered using bunkerweb, but it's more involved than I'd hoped. Does anyone have lighter solutions or alternatives that could help?
5 Answers
Rate-limiting and honeypot pages that humans wouldn't trigger could also help. Any bot that hits those traps would get instantly blacklisted.
I've taken a different approach: I publish their IP ranges and redirect them to a page that humorously highlights how awesome I am. It's a fun way to deal with them! Setting this up with Traefik was really easy.
While it's frustrating that scrapers ignore crawl rules, if their access to invalid pages is stressing your server, it might point to underlying issues in your site's architecture. Accessing those non-existent agenda pages shouldn’t cause significant strain; usually, it’s just one database query. But logging 21k lines just for scrapers isn’t reasonable, and I get how annoying that can be.
I've deployed Anubis and I've been really happy with it! It does a great job of managing pesky web scrapers.
One solid option to consider is using Fail2ban. It's pretty popular among DevOps teams and works well for blocking those scrapers effectively.

Thanks! I'll definitely look into Anubis.