I'm currently running a non-commercial recipe scraper called dishTXT.lol, but I'm running into persistent 503 errors and soft rate limiting, despite trying to be very conservative with my requests. I've implemented per-domain throttling of 2 seconds, aggressive caching using Cloudflare's D1, and I'm rotating user agents and headers. When I encounter blocks, I fall back to ScraperAPI as a last resort. I'm beginning to wonder if I'm missing something crucial—like quirks related to Cloudflare Workers, issues with IP reputation, or fetch behavior—or if this is just par for the course for web scraping in 2026. I'd really appreciate insights from anyone who's faced similar challenges at scale.
4 Answers
If you’re not already, consider posting about this in the web scraping community too; they might have more specific insights for your setup.
It sounds like your per-domain delay is good, but hidden concurrency might be causing issues. I faced a similar challenge while working on a project for a client. Try limiting your in-flight fetches to one per origin and queue the rest. If the 503 errors stop, that usually points to infra-level throttling rather than your crawl rate being the problem.
It's great that you're being cautious, but make sure you're respecting the robots.txt file of the sites you're scraping. Sometimes, websites protect themselves with Cloudflare, and you might be hitting their rate limits without realizing it. They might be monitoring your requests closely. It’s not just about being conservative with request patterns; there can be several factors at play influencing these 503 errors.

I'm actually sticking to the robots.txt rules and honoring the specified crawl delays. When I get 429 or 5xx responses, I back off significantly. I avoid bypassing Cloudflare protections completely. The intermittent 503s I'm facing seem more like infrastructure throttling rather than outright blocks.