I've noticed that sites, like gnu.org, have been overwhelmed by AI crawlers, impacting their availability. Given that AI companies are consuming a significant amount of energy and resources, it's frustrating to host for people while AI bots are hogging bandwidth. I'm curious if there are ways to make it costly for these bots without draining my own CPU or memory resources. Specifically, is there a way to hang a TCP connection so that the kernel doesn't have to manage CPU or memory for that socket, effectively causing the bot to timeout on its end? I'm also looking for other budget-friendly tactics to deal with these crawlers and whether there are existing modules or WAF solutions for this.
5 Answers
Instead of trying to drain resources from AI bots, why not serve them fake cached data? If you can spot the bots, it's pretty simple to set this up. They're probably too focused on gathering real data to notice they're being tricked.
Cloudflare's labyrinth feature is definitely something to look into for handling bots. It's designed specifically for that purpose, so it's worth checking out.
Fail2Ban could help too, although it does need regular monitoring to keep up with the bots. It can be a bit of a workload as you catch them all.
You might want to check out Nepenthes or Cloudflare’s AI Labyrinth. These tools can help you manage bot traffic effectively and give you some control over what gets through.
I’m curious, are there any specific implementations for those tools?
Another interesting tactic is using a Markov Babbler. You can generate random text and mix it into your content to confuse the bots. For instance, take public domain books and slightly alter them to disrupt the crawling process. This could potentially hurt their datasets.

But what about the smaller AI companies? They might not have infinite funds, and coordinated DDoS defenses could really hurt them.