I created a web scraper for airbnb.com and kiwi.com using Python and Playwright. It successfully runs on my local machine, but when I try to deploy it on GitHub Actions, it triggers a bot detection mechanism. I switched to using playwright_stealth and changed the user agent, which allowed access, but now some elements are still missing or broken. Can anyone offer advice on how to tackle this issue?
3 Answers
The issue likely stems from IP trust levels. When you're running the scraper on your local machine, it's using your residential IP, which has a good trust score. But once you deploy it, it runs through a server IP that's less trusted, leading to blocks. Consider testing with proxies locally, or even using a geolocation that matches the target content more closely. This might help ensure you’re getting the correct data when you attempt to scrape.
It could be an IP issue, which is kind of out of your hands. When moving from local to a server, the switch in IP could definitely be triggering those bot protections and causing missing elements.
Check out the project 'crawl4ai' on GitHub. It might give you some useful insights or tools to enhance your scraping script!
Thanks for the suggestion! My goal was to build a scraper from the ground up to really understand the process better. Getting HTML blocks into an LLM isn't that hard, but since mine works fine locally, I want to pinpoint what's going wrong when I deploy.