I'm diving into web scraping and have a proof of concept working locally with Python and Selenium. It takes me about 2-3 minutes per request, and I'm considering moving the process to AWS Lambda to run bigger operations without manually triggering the script multiple times. However, I'm concerned about getting my IP banned since the site I'm scraping uses Cloudflare. Does anyone have experience with this? Are free proxies a viable option, or are they likely blocked too? Also, I'd like to know how much it would cost to run several Lambda functions in parallel to scrape data once a day.
2 Answers
You’re probably looking at getting blocked right away. AWS Lambda uses IPs that can be flagged as data center IPs by security tools on many sites. I've had scraping issues with AWS in the past, so I'm a bit skeptical about Lambda for this. Consider using residential proxy services—they could help reduce the risk of getting blocked. I'm not sure what the costs are though.
Yeah, you'll definitely need some sort of proxy solution. Lambda functions come with AWS IPs that many sites have blacklisted. For your cost question, use the AWS pricing calculator; I believe if you stay within the free tier, you might not incur charges for Lambda, depending on your request volume.
Any recommendations for proxies that are less likely to be blocked?