Trouble with Web Scraping on GitHub Actions

0
5
Asked By CraftyOtter92 On

I created a web scraper for airbnb.com and kiwi.com using Python and Playwright. It successfully runs on my local machine, but when I try to deploy it on GitHub Actions, it triggers a bot detection mechanism. I switched to using playwright_stealth and changed the user agent, which allowed access, but now some elements are still missing or broken. Can anyone offer advice on how to tackle this issue?

3 Answers

Answered By DevGuru42 On

The issue likely stems from IP trust levels. When you're running the scraper on your local machine, it's using your residential IP, which has a good trust score. But once you deploy it, it runs through a server IP that's less trusted, leading to blocks. Consider testing with proxies locally, or even using a geolocation that matches the target content more closely. This might help ensure you’re getting the correct data when you attempt to scrape.

Answered By ScrapingWhiz On

It could be an IP issue, which is kind of out of your hands. When moving from local to a server, the switch in IP could definitely be triggering those bot protections and causing missing elements.

Answered By CodeNinja88 On

Check out the project 'crawl4ai' on GitHub. It might give you some useful insights or tools to enhance your scraping script!

CraftyOtter92 -

Thanks for the suggestion! My goal was to build a scraper from the ground up to really understand the process better. Getting HTML blocks into an LLM isn't that hard, but since mine works fine locally, I want to pinpoint what's going wrong when I deploy.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.