Hi everyone! I'm a data analyst with no formal software development training, and I've been working with code from colleagues. I've noticed many examples where they store large DataFrames as class attributes, like this:
```python
import pandas as pd
class MyClass:
def __init__(self, df: pd.DataFrame, ...):
self.df = df
# Other parameters initialized here
def do_something_using_df(self) -> float:
pass
```
I initially didn't think much of it, but I've come to realize that these DataFrames can consume a lot of memory—think millions of rows and hundreds of columns! By creating instances of such classes, we end up duplicating the DataFrame, which can lead to several GBs of memory usage since these objects often remain referenced and aren't garbage collected.
So, I'm wondering: is it bad practice to store large DataFrames inside class attributes? I couldn't find clear guidance on this, especially since methods like `do_something_using_df()` typically perform specific calculations. While I think it's okay for smaller DataFrames with just a couple of columns, the real issue seems to be my colleagues who routinely dump massive DataFrames into classes without proper cleanup. Would a different approach, perhaps separating data handling and processing, maintain the single responsibility principle? I'd love any perspectives on the pros and cons of storing DataFrames as class attributes, not just in Python but across programming languages!
1 Answer
Yes, storing large DataFrames in class attributes can be considered a bad practice, especially if it leads to unnecessary duplication and increased memory usage. Pandas DataFrames are already objects, so there’s typically no need to wrap them in another object unless you truly need to enhance their functionality. Instead, consider using the 'pipe-and-filter' pattern, where you have functions that handle the DataFrame as input and return it after processing. This keeps your design cleaner and avoids memory issues associated with creating multiple instances.
That sounds interesting! I looked up the pipe-and-filter pattern, and it seems that could be really useful for our needs. Would that approach only work with functions, or could it be applied to classes as well? In our case, we often rely on inheritance for our data operations.