I'm currently leading a pilot project to set up an enterprise Data Lakehouse on AWS for a university. I chose the Medallion architecture, which includes Bronze for raw data, Silver for clean validated data, and Gold for modeled BI data, to ensure data quality and scalability. I'm looking for recommendations on AWS services to support this flow. I'm considering the AWS Glue Data Catalog for central indexing on S3, using Amazon Athena for SQL queries on the Gold data, and Amazon QuickSight for visualization. However, I'm facing challenges with ingestion, storage, and transformation. My current database is on RDS, but I'm wondering what the best options are. I'm also interested in any courses or tutorials that could help me out. Thanks!
2 Answers
If you're managing a lot of users without AWS console access, Sagemaker Unified Studio can simplify collaboration. It offers a more user-friendly interface for those people, which could streamline some of your processes.
Choosing the right services really hinges on your data type, refresh rates, and overall volume. For ETL, I highly recommend using AWS Glue. For data warehousing, you can't go wrong with Redshift for your Gold layer. For your Bronze and Silver layers, consider Iceberg Tables on S3 for efficient data lake storage. For cataloging, look into DataZone or Sagemaker Catalog, and for BI, QuickSight is a solid choice. If your dataset is relatively small or well-partitioned, Athena can be a great option for querying as well.

Thanks for your insights! The data I'm dealing with is extensive, covering everything from student records to classroom info, so I'm considering batch updates in the early mornings. Cost optimization is a priority, hence keeping the original format in S3 for the Bronze layer.