I'm looking for some advice on using Python specifically for data gathering tasks in my job. What are the essential Python skills or libraries I should focus on to effectively gather data? I'd really appreciate any insights or tips you all have.
4 Answers
As a software engineer turned project manager, I've used Python to scrape websites for news relevant to my industry. I recommend starting with libraries like Beautiful Soup or Selenium. I personally prefer Beautiful Soup for its simplicity!
The tools you use really depend on the data source. For simpler tasks, you can utilize libraries like 'requests' or 'httpx' for making HTTP requests. Pandas is fantastic for handling tabular data and even offers easy plotting options. If you're pulling data from databases, you might want to look at SQLAlchemy for queries. For scientific data, Dask is a good option for processing it efficiently.
I've worked with Beautiful Soup 4 for parsing older websites. For tasks that involve more interaction, I switched to Selenium and Playwright. Nowadays, I tend to focus more on working through APIs, which can be much more straightforward.
I work with Python to scrape vital data from a vendor's website that's supposed to deliver us business-critical information. It's been almost 2 years, and I utilize Python for the whole process: scraping, cleaning, and pushing data to our database. Right now, I'm using Playwright, which has been great!
In my previous job, we mainly used Beautiful Soup for straightforward web pages and Selenium for more complex, dynamic content. Recently, we've transitioned to Playwright for all new projects and even started updating older ones.

That's a good point! Selenium can be a bit better when it comes to bypassing bot detection at times. I had to switch a couple of my scripts back to it after running into 403 errors with Beautiful Soup.