I've noticed that GPTBot has been scraping the same page on my website non-stop for about a day now. I use a URL with a hashed return URL as a path parameter, which results in a lot of unique URLs all pointing to the same content. It seems like OpenAI hasn't implemented canonical tags yet, so GPTBot is getting stuck in a loop. I tried throttling its requests to one every three seconds, but it was still overwhelming. It's starting to feel like harassment! I'm curious how others are managing this situation.
2 Answers
One way you can deal with this is by blocking the GPTBot's IP or IP range on ports 80 and 443. It might help to restrict access a bit.
At my company, we actually want to allow GPTBot since we get decent traffic from ChatGPT. We set up a CloudFlare rule to cache everything for requests that identify as GPTBot. It also removes all query parameters from the cache key. This cut our server load almost immediately! We even extended that rule to include all bots, and now our servers can handle human traffic way better!
This is super helpful, thanks! I’ll definitely look into this.
Yeah, I thought about that too, but I have to admit that tarpitting it sounds way more fun!