I'm looking to do some web scraping on a specific site and I already created a proof of concept that works fine with Python and Selenium on my local machine. The scraping takes about 2-3 minutes per request, and I want to scale this up without having to manually run the script multiple times. I'm considering using AWS Lambda for this purpose but I'm concerned about potential IP bans since the site I'm targeting uses Cloudflare. I've heard free proxies might not work either since they might be blocked. I also want to know how much it would cost to run multiple Lambda functions to scrape data once a day.
5 Answers
I suggest checking out the Zyte API. It's designed specifically for web scraping and takes care of a lot of the issues you might face, plus their pricing is pretty reasonable.
It really varies depending on the site and how they set up their bot protections. While Lambda might offer changing IPs, they’re often all categorized as data center IPs, which can still lead to roadblocks with sites that ban those ranges.
Using a proxy is a good idea here. Just remember that AWS Lambda comes with its own set of IPs that, more often than not, are banned on various sites. As for costs, you can use the AWS pricing calculator to get a rough estimate, but depending on your usage, you might find that the Lambda costs could fit within the free tier limits.
You're likely to get blocked since the IPs used by Lambda are often flagged as data center IPs. I've run into issues with scraping on AWS before, although not with Lambda specifically. You might want to look into residential IP services which are designed for scraping, but I'm not sure about the costs involved.
I've used AWS for smaller projects and it works fine for those one-off scrapes. Just make sure to adjust your headers and some settings to lower the chance of getting caught, though it's not ideal for large-scale scraping.
Related Questions
Cloudflare Origin SSL Certificate Setup Guide
How To Effectively Monetize A Site With Ads