How Can I Scale Web Scraping for 500k Hotel Reviews Without Getting Blocked?

0
0
Asked By TrickyBiscuit42 On

I'm collecting reviews for about 500,000 hotels, but I'm hitting a wall when trying to scrape data. While it works when I run the scraper locally, moving this to an EC2 instance results in getting blocked really fast. I don't need the data instantly; updating every month or so is fine. I'm not keen on using residential proxies due to costs and complexity. I'm looking for practical ways to scrape this amount without getting blocked using mainly open-source tools. Are there specific strategies that work better in cloud setups like EC2? Any architectural ideas, such as batching or distributed scraping, would be really helpful!

4 Answers

Answered By HomeNetSurfer On

Data center IPs usually face blocks. You could try proxy services or something like firecrawl for better results. This could help mask your requests.

Answered By CloudWhiz21 On

Consider running Tailscale on your home network and your EC2 instance. That way, you can use your home network as an exit node, making it appear that the scraping is coming from your home IP. It's cost-effective, and while home IPs may have some rate limits, they generally aren't blocked outright because they get reassigned by ISPs frequently.

Answered By CleverCoder99 On

From my experience, the issue with being blocked on EC2 is mainly due to the IP reputation. Home IPs are generally clean and won't get flagged as quickly, whereas AWS IPs often get abused by scrapers and are blocked right away. A few options to consider on EC2:

- Rotate through multiple EC2 instances across different regions or use spot instances to change your IP frequently.
- Incorporate random delays of 10-30 seconds between requests, along with realistic browser headers and slow scrolling actions.
- Leverage tools like Selenium with undetected-chromedriver or Playwright to mimic real user behavior.
- Implement batch scraping, targeting 5,000-10,000 hotels per day, instead of overwhelming the server.

This slower approach can work well since you’re okay with not scraping everything at once; many do large hotel scraping projects like this without expensive proxies. Have you implemented random user-agents or delays yet?

Answered By ProxyPal88 On

If you're open to using proxies, you might want to look into dedicated services like Bright Data or similar providers to help handle the scaling.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.