Hey everyone! I'm diving into using S3 Tables to query data directly in S3 without the need to transfer it to Redshift or a traditional data warehouse. I plan to use it alongside Athena and Glue, but I have a couple of questions:
1. What file formats do you all recommend for S3 Tables in terms of performance and cost? Specifically, I'm considering options like Parquet, ORC, and CSV.
2. Has anyone here tried using Lake Formation for table-level access control with S3 Tables?
3. Do you have any tips for optimizing query performance and cost-efficiency when working with large datasets?
I'd really appreciate hearing about your experiences and suggestions. Thanks a lot!
4 Answers
Absolutely go for Parquet or ORC. Columnar formats drastically reduce storage and scan costs. CSV will really add up when you’re using Athena. If you do use Lake Formation with Glue Catalog for governance, ensure your permissions are well managed or it can spiral into a debugging nightmare. To speed up your queries, partition by the fields you filter on most frequently and aim for file sizes between 128 to 512 MB; too many small files can hurt performance.
I advise steering clear of Lake Formation if you don’t have a ton of tables. It can be overly complex and buggy compared to just using IAM policies for access management. Parquet works best unless you need to perform updates or deletes, in which case, consider looking into Iceberg for better support.
I haven’t personally tested it yet, but I think Lake Formation only integrates with the Glue Catalog. So, it might not be compatible with S3 Tables. I could be mistaken if there have been recent updates, though!
From my experience using S3 Tables with Athena, Parquet is definitely the best choice. It offers faster queries and is more cost-effective compared to CSV. While ORC is okay, I would still stick with Parquet whenever possible. To keep performance and costs in check, make sure to partition your data wisely, like in the format s3://bucket/year/month/day, and use compression methods like gzip or snappy. Also, try to limit development queries since Athena has workgroups that can help manage those limits.
Those year/month/day partitions aren't great for date-range filters. A better option could be using something like event_date=YYYY-MM-DD for Hive-style partitioning.

We handle about 1 GB of data daily, and most queries need all the columns. Just curious if that changes your recommendation?