I'm looking to build a web scraper to extract product data such as name, price, description, and availability from dental supply sites like henryschein.com. So far, I have experimented with tools like Apify using Puppeteer and Playwright along with BrightData proxies to sidestep bot detection. Despite my efforts, I've encountered several challenges, including errors like `net::ERR_HTTP2_PROTOCOL_ERROR`, issues with waiting for elements to load, and variability in how pages render depending on the setup.I want to create a robust scraper that can:
- Access product listings pages
- Extract product details into a structured format (like JSON or Google Sheets)
- Handle pagination as necessary
I would appreciate any selector examples, advice on utilizing Puppeteer/Cheerio with BrightData, and insight into whether using Apify is excessive or if a simpler solution could suffice. If needed, I can provide a sample page or HTML snapshot for clarity.
1 Answer
Have you thought about trying a basic Python script? How many products are you aiming to scrape from each site? Like 100, 1,000, or 10,000? Also, are you starting with just one site, or are you tackling multiple? Just a heads-up: most job scrapers usually require customization for each site due to their unique structures.
I typically scrape between 500 to 5,000 products per vendor. I'm starting with about 3-4 sites but plan to scale up to 10-12 later. Each site's structure varies, with some using dynamic loading and others just being basic HTML. I'm looking into using Apify and n8n for better automation!