Hey everyone! I'm reaching out for some help with a project I'm working on for school. I'm a beginner in both Python and web scraping, and I'm feeling a bit overwhelmed. I need to gather data from 50 grocery store websites, specifically looking for the names, emails, and phone numbers for the general manager and the head of recruitment. I want to compile all of this information into an Excel sheet. Here's what I'm thinking:
1. First, I'll create a list of all 50 store URLs. I might use ChatGPT to help verify if the URLs are correct.
2. Next, I'll need a script that can crawl each store's website to find the relevant info. The problem is, I'm not sure how to navigate through sites if the data is under subpages or different labels.
For example, the general manager might be listed as "GM," "General Manager," or something else entirely, and I don't know how to tackle that variation. I'd really appreciate any guidance on how to write this script or tips on where to start. Thanks a lot!
2 Answers
I've done quite a bit of web scraping, and I can tell you that grocery store sites might not be easy to scrape. Many of them have protections in place that could block simple requests. Just getting the HTML is a real challenge, and you might run into issues very quickly. This project could be way too advanced for a beginner. If you provide more details, I might be able to assist further!
Honestly, you might find it easier to just do this manually rather than trying to write a scraper, especially since there are 451 stores to deal with. You have to account for different titles like "General Manager" and "Head of Recruitment" which could vary from site to site. Automating this isn't as straightforward as it seems, and you might not even find all the info you're looking for. Hand-writing might be a better option here!
Related Questions
How To: Running Codex CLI on Windows with Azure OpenAI
Set Wordpress Featured Image Using Javascript
How To Fix PHP Random Being The Same
Why no WebP Support with Wordpress
Replace Wordpress Cron With Linux Cron
Customize Yoast Canonical URL Programmatically