Best Practices for Querying Data in S3 Tables

0
23
Asked By SkyBluePenguin92 On

Hey everyone! I'm diving into using S3 Tables to query data directly in S3 without the need to transfer it to Redshift or a traditional data warehouse. I plan to use it alongside Athena and Glue, but I have a couple of questions:

1. What file formats do you all recommend for S3 Tables in terms of performance and cost? Specifically, I'm considering options like Parquet, ORC, and CSV.
2. Has anyone here tried using Lake Formation for table-level access control with S3 Tables?
3. Do you have any tips for optimizing query performance and cost-efficiency when working with large datasets?

I'd really appreciate hearing about your experiences and suggestions. Thanks a lot!

4 Answers

Answered By QueryMaster88 On

Absolutely go for Parquet or ORC. Columnar formats drastically reduce storage and scan costs. CSV will really add up when you’re using Athena. If you do use Lake Formation with Glue Catalog for governance, ensure your permissions are well managed or it can spiral into a debugging nightmare. To speed up your queries, partition by the fields you filter on most frequently and aim for file sizes between 128 to 512 MB; too many small files can hurt performance.

Answered By CloudGuru99 On

I advise steering clear of Lake Formation if you don’t have a ton of tables. It can be overly complex and buggy compared to just using IAM policies for access management. Parquet works best unless you need to perform updates or deletes, in which case, consider looking into Iceberg for better support.

Answered By TestingTechie On

I haven’t personally tested it yet, but I think Lake Formation only integrates with the Glue Catalog. So, it might not be compatible with S3 Tables. I could be mistaken if there have been recent updates, though!

Answered By DataWhiz123 On

From my experience using S3 Tables with Athena, Parquet is definitely the best choice. It offers faster queries and is more cost-effective compared to CSV. While ORC is okay, I would still stick with Parquet whenever possible. To keep performance and costs in check, make sure to partition your data wisely, like in the format s3://bucket/year/month/day, and use compression methods like gzip or snappy. Also, try to limit development queries since Athena has workgroups that can help manage those limits.

CuriousCoder09 -

We handle about 1 GB of data daily, and most queries need all the columns. Just curious if that changes your recommendation?

SmartPartitioner -

Those year/month/day partitions aren't great for date-range filters. A better option could be using something like event_date=YYYY-MM-DD for Hive-style partitioning.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.