Addressing Scraping Challenges from AI Bots

0
13
Asked By TechWizard42 On

I've noticed that scrapers from companies like OpenAI and Anthropic seem to ignore the rules in robots.txt. They also often target event pages looking for bizarre dates like 2139-13-45, which ends up being an exhausting process for my server. I'm looking for a straightforward solution to tackle this issue. It seems like more robust tools like mod_security are dated and often overly complicated for my needs, especially for smaller sites on shared hosting. For larger sites, I've considered using bunkerweb, but it's more involved than I'd hoped. Does anyone have lighter solutions or alternatives that could help?

5 Answers

Answered By SecureSiteGuard On

Rate-limiting and honeypot pages that humans wouldn't trigger could also help. Any bot that hits those traps would get instantly blacklisted.

Answered By ScraperSmasher On

I've taken a different approach: I publish their IP ranges and redirect them to a page that humorously highlights how awesome I am. It's a fun way to deal with them! Setting this up with Traefik was really easy.

Answered By ServerSleuth On

While it's frustrating that scrapers ignore crawl rules, if their access to invalid pages is stressing your server, it might point to underlying issues in your site's architecture. Accessing those non-existent agenda pages shouldn’t cause significant strain; usually, it’s just one database query. But logging 21k lines just for scrapers isn’t reasonable, and I get how annoying that can be.

Answered By NinjaCoder88 On

I've deployed Anubis and I've been really happy with it! It does a great job of managing pesky web scrapers.

TechWizard42 -

Thanks! I'll definitely look into Anubis.

Answered By DevOpsDude99 On

One solid option to consider is using Fail2ban. It's pretty popular among DevOps teams and works well for blocking those scrapers effectively.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.