I'm working with an S3 bucket where various clients will upload parquet files related to different topics like userdata, revenue data, and marketing data. I'm torn between two naming conventions for the bucket structure. Should I organize it by client first, like this: bucket/client1/userdata, bucket/client2/userdata, bucket/client1/revenuedata? Or would it be better to organize it by topic first, such as bucket/userdata/client1, bucket/userdata/client2? The topics are generally similar but differ in schema (some have more fields than others). We're planning to ingest this data into Databricks every day, and I'd love to hear your thoughts on the best approach!
5 Answers
I have a similar setup where I use an ingress facade to manage data flow through a bucket and queue before it reaches the consumer. This way, I can handle events effectively using S3 object-created event notifications, if that aligns with your needs.
If you're looking at long-term storage and have a substantial amount of data, I recommend giving each client a unique bucket name. It simplifies cost allocation since AWS cost tags work better at the bucket level, and this makes it easier to manage access and auditing without mixing data.
I completely agree—having separate buckets helps keep everything organized and compliant.
Don't forget about your exit strategy. Think about how you'll prove data deletion. Clearly define whether it’s client data or user data and manage accordingly. Lifecycle rules can be your friend here to automate some of that.
Consider the technical aspects—particularly permissions. If you structure it with the client identifier first, it could simplify creating access policies. Also, remember that S3 operates with prefixes. If Databricks processes userdata, it would find it easier if that prefix comes first. However, if you process one client at a time, a client-first approach would still work.
Good point! We'll process by client, as these are essentially sub-branches of our main company.
If clients are writing directly to this bucket, managing a client/prefix is definitely easier. However, I highly advise against giving them direct access to your main Databricks bucket. Better to set up an ingestion bucket where clients can drop data, and you manage how that data gets to your processing bucket. This way, you avoid clients messing up your data formats. If a single Databricks consumer is handling data from all clients, lean toward the single bucket model. It reduces management overhead and schema configurations every time you onboard a new client.
Definitely! Since they're branches of our company, it makes sense to give them direct access to an ingest bucket.

Exactly! As long as you don't overflow the bucket limit, this method will streamline your data management.