Is it common for one Python package to overwrite files of another installed package?

0
11
Asked By CuriousCoder42 On

Hey everyone, I encountered something unusual with package installations and I'm hoping you can shed some light on it. I was using PySpark version 3.5.5 without any issues. However, after I upgraded MLflow from a 2.x to a 3.x version (specifically the Databricks extra), PySpark started giving me errors, particularly those associated with Spark 4.

To troubleshoot, I created a clean virtual environment and installed PySpark 3.5.5. At this point, the site-packages folder only contained the expected files for PySpark. However, when I installed the Databricks Connect library, which is a transitive dependency of MLflow, I noticed it was directly modifying the files for PySpark in my site-packages directory. Instead of just hooking into it at runtime or extending functionality, it was literally overwriting the actual PySpark code.

I assumed that typically packages either use monkey-patching or create separate extension layers rather than overwriting another package's files. Now, I'm wondering if this behavior is standard practice in the Python community or if I have a right to be surprised by it.

5 Answers

Answered By SafetyNerd23 On

It seems this is a known issue. There are discussions from Databricks employees acknowledging this behavior. It's concerning because overwriting files can lead to all kinds of problems.

Answered By TechieTommy On

Definitely not normal behavior! This should probably be reported as a bug with the Databricks library. There’s no good reason for it to overwrite files from another package like that.

Answered By CodeChallenger99 On

Are you sure it doesn’t require a newer version of PySpark? Sometimes pip does automatic upgrades that can mess things up a bit.

Answered By LibertyLine15 On

That’s pretty sketchy behavior. But, it’s somewhat expected in proprietary SDKs; they often do these magical modifications to make things work seamlessly. Still, overwriting files instead of just placing them separately is generally not a good practice.

Answered By DevDude88 On

If you check the package metadata for databricks-connect, you'll find it actually claims to provide and obsoletes PySpark, meaning it’s meant to replace it rather than just coexist. That's problematic and certainly raises some red flags.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.