I'm collecting reviews for about 500,000 hotels, but I'm hitting a wall when trying to scrape data. While it works when I run the scraper locally, moving this to an EC2 instance results in getting blocked really fast. I don't need the data instantly; updating every month or so is fine. I'm not keen on using residential proxies due to costs and complexity. I'm looking for practical ways to scrape this amount without getting blocked using mainly open-source tools. Are there specific strategies that work better in cloud setups like EC2? Any architectural ideas, such as batching or distributed scraping, would be really helpful!
4 Answers
Data center IPs usually face blocks. You could try proxy services or something like firecrawl for better results. This could help mask your requests.
Consider running Tailscale on your home network and your EC2 instance. That way, you can use your home network as an exit node, making it appear that the scraping is coming from your home IP. It's cost-effective, and while home IPs may have some rate limits, they generally aren't blocked outright because they get reassigned by ISPs frequently.
From my experience, the issue with being blocked on EC2 is mainly due to the IP reputation. Home IPs are generally clean and won't get flagged as quickly, whereas AWS IPs often get abused by scrapers and are blocked right away. A few options to consider on EC2:
- Rotate through multiple EC2 instances across different regions or use spot instances to change your IP frequently.
- Incorporate random delays of 10-30 seconds between requests, along with realistic browser headers and slow scrolling actions.
- Leverage tools like Selenium with undetected-chromedriver or Playwright to mimic real user behavior.
- Implement batch scraping, targeting 5,000-10,000 hotels per day, instead of overwhelming the server.
This slower approach can work well since you’re okay with not scraping everything at once; many do large hotel scraping projects like this without expensive proxies. Have you implemented random user-agents or delays yet?
If you're open to using proxies, you might want to look into dedicated services like Bright Data or similar providers to help handle the scaling.

Related Questions
How To: Running Codex CLI on Windows with Azure OpenAI
Set Wordpress Featured Image Using Javascript
How To Fix PHP Random Being The Same
Why no WebP Support with Wordpress
Replace Wordpress Cron With Linux Cron
Customize Yoast Canonical URL Programmatically