I've been using CSV files to store tabular data in my projects, but with a new project that includes thousands of tables with about a million rows each, I need something more efficient. Most of the time, I only need to access around ten specific rows from a table, but with CSVs, I have to read the entire file, which is really slow. I've switched from using Pandas to Polars for faster data retrieval, but I'm still facing the same inefficiency because I have to load the whole file. While Parquet is faster, it also requires reading the full file to get specific rows. SQLite allows specific row queries but is slower than reading from CSVs with Pandas. What I'm looking for is a storage format that allows me to retrieve data quickly—at least as fast as Parquet with Polars—and enables me to do random access using SQL-like queries without the need to read the entire table. My data is numerical with about ten columns, and I use a primary key in the first column. I'm not worried about storage space, but ideally, I want something with an easy-to-use API similar to Pandas and Polars, although I'm open to more complex solutions if they meet my performance needs.
5 Answers
Have you looked into utilizing the lazy mode of Parquet with Polars? It allows for reading only the necessary chunks based on your queries, which could significantly speed up your data access without needing to scan the whole file.
Absolutely, using lazy loading could trim down your loading times.
If performance is your top priority and you want efficient random reads, consider using a key-value store like LMDB for ultra-fast access. Just remember, if you need to query beyond just the primary key lookups, then SQLite could still be an option, but it'll be slower.
LMDB does seem quite fast! But some limitations with its Python compatibility have me slightly hesitant.
You might also want to look into Apache Arrow with the Feather format. It's made for fast random access and can integrate nicely with Polars and Pandas. This could be a good solution for your requirements.
That sounds promising—I'll do some research on Feather!
Feather might just be the best fit for what you're trying to achieve, considering your use of Polars.
Just a note, keep in mind that no software can entirely avoid reading the entire file if you’re doing complex searches. However, if you're set on improving access times, you may want to complement your CSV files with a database like SQLite or even explore DuckDB alongside that, maximizing your efficiency based on your usage patterns.
That makes sense. I guess having a hybrid solution could work well in balancing both quick access and flexibility.
Good point! A combination could offer the best of both worlds.
Have you tried using DuckDB? It's designed for analytics and works like SQLite but is faster for these kinds of operations. It supports both in-memory and on-disk processing and has a SQL syntax similar to Postgres. Plus, there's a Python package for easy integration with your projects!
Yeah, DuckDB seems to be a solid option for your needs. It should handle the full scans quicker than SQLite and still allow you to access specific rows efficiently.
Thanks for the suggestion! I'll definitely check out DuckDB.
That might help! I should explore how Polars LazyFrames handle this.