I'm currently working on setting up Change Data Capture (CDC) from my on-prem SQL Server database to S3 using AWS Database Migration Service (DMS), but I've run into some unusual issues and I'm wondering if others have had similar experiences. During the full load phase, which occurs before the CDC kicks in, I'm seeing multiple `LOAD*.parquet` files generated, each with roughly the same row count. This leads to duplications in my data when queried from Athena due to timestamped files that reflect transactions happening during replication. AWS support tells me that this is intentional, but it doesn't seem right since DMS also interfaces with Redshift, which doesn't enforce constraints like traditional databases do. Additionally, some updates seem to be missing in these timestamped files. Is this a common problem for anyone else?
1 Answer
Yeah, DMS has its share of headaches. It's like it promises a smooth ride but then throws unexpected bumps along the way. They’ve started documenting the pitfalls better, but it’s pretty frustrating. For example, if there’s heavy writing to your source table during the full load, you might find DMS dumps duplicates in S3 or cached changes. That just doesn’t seem reliable to me.
It's such a letdown! I found some fine print in the documentation that warns about duplicates, but when you're using it, you barely notice until it's too late.