I often find myself dealing with messy data, especially when working with third-party exports and API responses. Issues I frequently encounter include CSV files that have inconsistent column names and JSON data that needs to be flattened to become usable. Recently, I've been using small Python scripts instead of relying on spreadsheets or heavier tools because they allow for faster and easier automation. However, I'm curious about how others manage this situation. Do you prefer cleaning data manually, utilizing extensive workflows with pandas, using ETL tools, or writing your own small utilities and scripts? I'd love to hear about your approaches in real projects!
5 Answers
I've mostly relied on heavy pandas workflows. I focus on normalizing JSON and renaming columns as needed for the database. Setting alerts on the ETL ensures that if third-party column names change, it doesn't break my workflow.
For raw JSON and CSV, I store them in places like S3 or other file stores. Then, I write some manual Python scripts to load them into my data warehouse, often storing JSON as columns for better usability.
If the data is messy coming from departments, I just ask them to sort out their exports. 😏☕
I prefer sticking with plain Python for data processes. I use jsonschema for validation, which helps me catch issues quickly. Simple transformations like renaming columns or adjusting data types are easier, and I usually document the structure in something like a Google Sheet to share.
I've been using DuckDB lately; it's incredibly fast! I also leverage an LLM to help me write Python or SQL code to clean the data. It's been a game changer!

Related Questions
How To: Running Codex CLI on Windows with Azure OpenAI
Set Wordpress Featured Image Using Javascript
How To Fix PHP Random Being The Same
Why no WebP Support with Wordpress
Replace Wordpress Cron With Linux Cron
Customize Yoast Canonical URL Programmatically