Need Advice on AWS Data Warehouse Setup with On-Prem Databases

0
3
Asked By DataNinja42 On

Hey everyone! I'm currently working on designing a data architecture and could really use some insights, especially from those who have experience with hybrid setups involving both on-premise and AWS data warehousing. Here's the situation: we operate a SaaS microservices platform on-prem using mainly PostgreSQL, with a sprinkle of MySQL and MongoDB. Each service has its own database per tenant, leading to a lot of small-to-medium databases. Currently, our total data is around 2.8 TB and we're adding about 600 GB each year.

We're aiming to establish a data warehouse on AWS to facilitate near real-time dashboards (a lag of 5 to 10 minutes is acceptable), perform historical trend analysis, and support multi-tenant analytics use cases.

### Current Design Considerations
I'm considering an architecture that involves:
1. Change Data Capture (CDC) from our on-prem PostgreSQL using AWS DMS.
2. A staging layer in Aurora PostgreSQL to combine all databases into one and maintain the production schema — though I'm wondering if it would be better to use S3 for staging before moving to Redshift, especially since Redshift isn't ideal for frequent inserts.
3. The final analytics layer:
* Should I stick with Aurora PostgreSQL, or is Redshift a more suitable choice?
* I'm also looking at using Amazon QuickSight for visualizations.

### Requirements
1. Near real-time change capture (5-10 minutes).
2. Budget-friendly.
3. Compatibility with dashboarding tools like QuickSight.
4. Scalability for new tenants/services in the future.

### Questions for You
1. Is anyone using a similar hybrid setup? What's worked for you, and what hasn't?
2. Any thoughts on using Aurora PostgreSQL versus S3 as a landing zone?
3. Do you think Redshift is overkill for our scale, or does it offer advantages worth considering?
4. Are there any potential pitfalls with using AWS DMS CDC pipelines at our scale?
5. Suggestions for managing real-time and historical data flows effectively (like using materialized views, Lambda refreshes, etc.)?

1 Answer

Answered By TechGuru99 On

You might want to check out Debezium for your CDC needs; it captures changes and writes them into S3 as Iceberg files which can be super cost-effective compared to RDS. For your dashboards, QuickSight works well with Athena queries on those Iceberg files!

From my experience with AWS, I'd steer clear of Redshift—it has its flaws, along with Lake Formation which lacks sufficient support. EMR doesn't play nicely with Glue catalog and Lake Formation either. Instead, if you can, I'd suggest going for a Snowflake + DBT + Power BI combo.

SkepticalUser34 -

What exactly do you find flawed about Redshift?

QueryMaster21 -

Why go with Debezium instead of AWS DMS?

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.