I'm looking for suggestions to improve the performance of a bulk data import system I've built. Right now, it reads data from Excel files and performs insert and update operations across multiple tables in MySQL. The import program works fine for the logic but the performance needs help. It's taking over 6 hours to process 70,000 rows, which is only a fraction of the total data I need to handle as each file can have up to 500,000 rows. I've already imported around 4 million records, with 2-3 million more on the way. The tables I'm dealing with are fairly large, and I have multiple validations happening too. I've tried batching, chunking reads, and caching, but I'm stuck. How do you handle bulk imports efficiently? I'm using Laravel 10 and MySQL for this.
6 Answers
If your insertion is slowing down as your tables get larger, it could mean that your queries aren't using indexes properly, especially if you have constraints. Check your query plans to see if indexes are being utilized as needed. An alternative approach some take is to drop indexes and constraints during the bulk insert, then recreate them afterward. This can significantly reduce the load during import, but keep in mind it does risk data integrity while the indexes/constraints are gone.
It sounds like you're dealing with a pretty hefty load. Have you considered loading the entire read-only tables into memory? This could speed things up. Don't forget about batching inserts whenever possible! Also, using transactions can help to group operations together, speeding up writes.
Have you tried converting the Excel spreadsheets to CSV before importing? Sometimes reading CSVs can be faster than Excel files since it's more straightforward for the database to process them.
Have you profiled your code to see where the time is going? Tools for performance tuning on your database can be super helpful as well. It sounds like there's a bottleneck somewhere that needs to be sorted before you go down the batching and parallelism route.
Your performance issues sound like they could stem from committing too often. Frequent commits can be really costly in terms of performance. Batches of 250 seem a bit small for the volume you're dealing with; try increasing that. Also, make sure you're not locking queries unnecessarily at the database level. Have you thought about using a connection pool with multiple threads? 50 connections might help speed things up!
It’s definitely a trade-off when it comes to database performance during large insertions. Just be cautious of how you manage indexes and constraints; optimization can really make a difference!
I haven't loaded everything into memory yet because I'm worried about RAM usage. But I always batch inserts when I can, though I hit a wall because I need to use foreign keys from one table to insert into another.