I've built a website that generates summaries of privacy policies, helping users understand what data apps collect and sell. Right now, I'm manually collecting the URLs for these policies, which is time-consuming and limiting. I'm looking to automate this process with web scraping to allow users to quickly find any app. I'm considering tools like Scrapy or ParseHub but unsure if they can consistently provide the right URLs. Are these the best options, or are there other tools I should consider?
1 Answer
Both Scrapy and ParseHub are great picks, but they cater to different needs. Scrapy is a powerhouse if you're familiar with Python; it gives you flexibility and control for large-scale projects. It's ideal for managing complex tasks like pagination and dynamic content.
ParseHub, on the other hand, is perfect for those who prefer a visual approach and aren't keen on coding. It's suited for simpler scraping jobs and can handle dynamic sites, but it might not be as efficient as Scrapy for bigger projects. Given your goals, Scrapy might be the way to go for better reliability and scalability as your app grows. Just keep in mind that web scraping can be tricky if website structures change often, so you’ll want to build some error handling into your scraper.
Thanks for the insights! How about testing? Is there an automated way to check that I'm grabbing the right info, or is that just not reliable?