Hey everyone! We're in the process of migrating our workload to AWS from an on-prem Cloudera setup. Right now, we're using Sqoop to load our RDBMS data to HDFS every day. I'm looking for a similar tool in the AWS ecosystem that can help with this task.
Ideally, I'd like to avoid using binlog CDC because it just seems too complicated for our needs. The tables I want to load have a clear updated_date field, and we don't delete records. Any suggestions?
3 Answers
AWS Glue could be a good fit. It connects to JDBC-compatible databases like MySQL and can run jobs to pull data where updated_date is greater than the last loaded time. Plus, it supports storing the data in formats like Parquet, ORC, or CSV in S3, similar to how HDFS works. You can schedule the jobs to run daily too!
You should check out AWS Data Migration Service (DMS). It’s designed for exactly this kind of job and can help you move data into S3 effortlessly.
Here are some useful links to get you started on AWS databases:
- [AWS Database Products](https://aws.amazon.com/products/databases/)
- [Amazon RDS](https://aws.amazon.com/rds/)
- [Amazon DynamoDB](https://aws.amazon.com/dynamodb/)
- [Amazon Aurora](https://aws.amazon.com/aurora/)
- [Amazon Redshift](https://aws.amazon.com/redshift/)
- [Amazon DocumentDB](https://aws.amazon.com/documentdb/)
- [Amazon Neptune](https://aws.amazon.com/neptune/)
Related Questions
How To Get Your Domain Unblocked From Facebook
How To Find A String In a Directory of Files Using Linux