I'm working on an automation project to regularly fetch the latest articles from several news and blog websites with daily updates in mind. My goals are to gather newly published content, eliminate duplicates, and ensure the system remains reliable even when site structures change. I've explored options like RSS feeds (though not all sites provide reliable or thorough feeds), web scraping with tools like Puppeteer or Cheerio, and available APIs. I'm looking for advice from anyone who's implemented a similar solution: Do you primarily use RSS or scraping for news/blog updates? How do you tackle structural changes or failures? Any specific tools or strategies you recommend?
5 Answers
You can't rely on long-term stability if the site doesn't offer either an API or a decent RSS feed. If they're not available, you're in for a rough ride!
If you want to dive deep into learning, building a simple web scraper could be a great project. For robust solutions, check out Scrapy; it's pretty powerful and well-suited for this kind of task.
I recommend using RSS or APIs first, falling back on scraping only when necessary. Have a site-specific adapter with selectors and setup alerts for when extraction fails. If a source frequently breaks, consider paying for access instead since 'free' scraping can become costly.
A good flow could be RSS (or another feed) leading to an API, and then scraping as a backup. Just keep in mind how each part affects long-term stability; start with RSS and APIs for a more dependable setup.
Definitely check for RSS feeds and sitemaps! You can use browser tools to find them with a simple search. Plus, if you put a page's source code into a language model, it can help identify the right elements for scraping. You can even automate updates if site structures change. And don't forget about news.google.com; you might find useful RSS feeds there too! Just be cautious of content behind paywalls when scraping.

It's so frustrating! They ditched our RSS feeds to push us into paid APIs and monopolize access. It feels like they are locking down the web instead of liberating information!