I have an S3 bucket where different clients will be dropping parquet files for various topics like userdata, revenue data, and marketing data. I'm trying to decide on a naming convention for the buckets. Should I go with a structure that prioritizes the client first, like:
* bucket/client1/userdata
* bucket/client2/userdata
* bucket/client1/revenuedata
Or, should it be structured to prioritize the topic, like:
* bucket/userdata/client1
* bucket/userdata/client2
My main concern is about the long-term management of these files since the schemas of the topics can differ, with some files having extra fields while others lack some. We plan on ingesting this data into Databricks daily.
5 Answers
Don't forget to consider your exit strategy. You'll need to prove that data has been deleted at some point. Think about whether the data is user-specific or client-specific. Setting lifecycle rules can really help with that.
If you're planning for long-term storage, having a unique bucket for each client is a solid choice. It simplifies cost tracking using AWS Cost Allocation Tags, which only work at the bucket level. Plus, if you won’t have more than a million customers, you won’t hit any account limits.
I totally agree, it makes everything easier for access and auditing.
I have a similar setup where I route data like this: ingress facade -> bucket -> queue -> consumer to handle events effectively using S3's object created notifications to fan out the data.
There are a couple of key technical points to think about. First, consider permissions—it's easier to set up clear access policies if the client identifier comes first. Second, remember that S3 operates on prefixes. If Databricks needs to process all userdata, it'll be simpler if that prefix appears first, while client-specific processing would also benefit from having the client ID first.
We plan to load data by client initially. These clients are actually branches within our company.
If clients will be writing directly to the bucket, I recommend structuring it with a client prefix for easier management. But I advise against giving them direct access to Databricks. Creating an ingest bucket where you control data movement into your processing bucket is safer to prevent issues with bad data formats. If you're primarily dealing with one Databricks consumer, a single bucket approach might reduce the management overhead of adding new schemas whenever a new client joins.
Yes, they'll have direct access as these are branches of our company. It will be an ingest bucket for sure, and I'm leaning toward that single bucket model.

That's definitely the way to go! As long as you can manage your clients, the unique buckets should handle any issues.