Should I Structure S3 Bucket Names by Client or Topic?

0
5
Asked By CuriousExplorer92 On

I have an S3 bucket where different clients will be dropping parquet files for various topics like userdata, revenue data, and marketing data. I'm trying to decide on a naming convention for the buckets. Should I go with a structure that prioritizes the client first, like:
* bucket/client1/userdata
* bucket/client2/userdata
* bucket/client1/revenuedata

Or, should it be structured to prioritize the topic, like:
* bucket/userdata/client1
* bucket/userdata/client2

My main concern is about the long-term management of these files since the schemas of the topics can differ, with some files having extra fields while others lack some. We plan on ingesting this data into Databricks daily.

5 Answers

Answered By FutureFocused On

Don't forget to consider your exit strategy. You'll need to prove that data has been deleted at some point. Think about whether the data is user-specific or client-specific. Setting lifecycle rules can really help with that.

Answered By DataGeek42 On

If you're planning for long-term storage, having a unique bucket for each client is a solid choice. It simplifies cost tracking using AWS Cost Allocation Tags, which only work at the bucket level. Plus, if you won’t have more than a million customers, you won’t hit any account limits.

ClientFirst99 -

That's definitely the way to go! As long as you can manage your clients, the unique buckets should handle any issues.

TaggingMaster -

I totally agree, it makes everything easier for access and auditing.

Answered By DataFlowMaster On

I have a similar setup where I route data like this: ingress facade -> bucket -> queue -> consumer to handle events effectively using S3's object created notifications to fan out the data.

Answered By TechieThoughts On

There are a couple of key technical points to think about. First, consider permissions—it's easier to set up clear access policies if the client identifier comes first. Second, remember that S3 operates on prefixes. If Databricks needs to process all userdata, it'll be simpler if that prefix appears first, while client-specific processing would also benefit from having the client ID first.

IngestOnly -

We plan to load data by client initially. These clients are actually branches within our company.

Answered By SmartStrategist On

If clients will be writing directly to the bucket, I recommend structuring it with a client prefix for easier management. But I advise against giving them direct access to Databricks. Creating an ingest bucket where you control data movement into your processing bucket is safer to prevent issues with bad data formats. If you're primarily dealing with one Databricks consumer, a single bucket approach might reduce the management overhead of adding new schemas whenever a new client joins.

BranchingOut -

Yes, they'll have direct access as these are branches of our company. It will be an ingest bucket for sure, and I'm leaning toward that single bucket model.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.