I'm looking to migrate a massive dataset of about 40 TB, spread across 80 Elasticsearch indices with a total of 10 to 14 billion documents, to Amazon S3. This data will be accessed frequently in the future, so I need an approach that ensures safety and speed during the migration process. I'm concerned about full error handling and keeping downtime to a minimum. I tried writing a manual Python script, but it doesn't seem efficient or reliable enough for such a large-scale project. Can anyone share effective methods or best practices for this migration? Also, how long might it take to transfer this amount of data?
3 Answers
I'm curious, what prompted the migration? Are there specific challenges you're encountering?
If you snapshot your data directly to S3, it can streamline the process a lot. I’ve handled about 5 TB this way and it worked pretty well for us! Just ensure that your snapshot process is efficient, and you should be in good shape.
One effective strategy is to set up your new storage solution with the right schema and consider using a format like Iceberg or Parquet. First, halt any writes to your Elasticsearch cluster to ensure data consistency. Then, create snapshots of your data and store them in S3. You can later migrate this data from the snapshot files in S3 to your new setup using a script that iterates over the files and converts everything as needed. This method is generally reliable for large migrations!

We're aiming to cut down on unnecessary storage in Elasticsearch. Right now, too many JSON fields are being indexed, inflating our storage needs. Our plan is to keep only essential fields in Elasticsearch and store full documents in S3 instead. This way, when users need complete records, we can fetch them directly from S3 using the record ID.