I'm currently developing a Python data processing pipeline that has two distinct areas of focus. The first part involves typical tabular data processing tasks like joins, aggregations, and cumulative calculations. The second part, however, is more complex and involves sequential and recursive operations where some values in the current row depend on results from the previous week's row. This makes it a bit tricky as it isn't a straightforward, vectorizable problem. I'm looking for architectural advice on the most efficient ways to handle this type of workload. I want to enhance performance without assuming there's just one ideal solution. Can anyone suggest approaches or frameworks that I should consider for this kind of task?
5 Answers
If your groups are constant, you can use the `shift` function in Polars for the second problem. Otherwise, you might need to calculate an index and join the necessary values. Let me know what you think!
For the first part, I strongly recommend Polars. It’s built for performance! And for part two, I suggest you might want to avoid Python if possible. Languages like Rust could offer significant performance advantages, and you can wrap Rust code into a Python library for integration. It’s worth considering!
You might find `numba` useful for the second part of your pipeline. It allows you to write regular for loops that get compiled into machine code, which can really speed things up. But how you implement it will depend on your approach for part one. Check out the documentation for more insights!
For part one, utilizing a dataframe library like Pandas or Polars should serve you well. As for the second part, look into using window functions—both Pandas and Polars have great support for that. Google BigQuery could be another option if it fits your workflow!
Have you considered using Polars? It's super fast for data processing tasks and could be a great fit, especially for handling large datasets.

Thanks for the suggestion! I know Rust has performance benefits, but I have to stick with Python for this project.