Hey everyone! I'm currently using AWS Lambda to process NBA player performance data—like points, rebounds, and assists—coming from S3 and inserting it into Supabase (which is based on Postgres). Each Lambda function I run uploads data in six segments: Last 3 games, Last 5 games, Last 10 games, Last 30 games, This Season, and Last Season, with each file being around 480 MB and containing up to 3 million rows.
Here's how I've set it up:
- I'm using a 10 GB Lambda with a 10k row batch size and 8 workers.
- There's a 15-minute timeout on the function.
To keep things tidy, I've implemented sharded deletes based on player_name prefixes to wipe out old data row-by-row before inserting new data. However, I'm still facing issues, particularly with HTTP 500 errors and timeouts during some DELETE operations, which makes the insertion process unpredictable.
I have a couple of questions for anyone experienced with this kind of setup:
1. How can I improve bulk deletes in Supabase/Postgres? Should I consider partitioning my tables, using TRUNCATE partitions, or implementing scheduled cleanup jobs?
2. Would it be more efficient to use UPSERT or merge operations instead of full data wipes?
3. Is there a benefit to splitting this into multiple smaller Lambdas for each data window rather than managing it all in one function?
I'm really looking for advice from anyone who's navigated pushing large datasets into Supabase/Postgres at scale. Any tips or experiences you can share would be greatly appreciated!
3 Answers
Have you considered setting up AWS Glue ETL jobs? This could streamline your data processing. Alternatively, AWS Batch with Fargate might work well too. These options can handle large data volumes more effectively than Lambda in your use case.
Honestly, Lambda might not be the best fit for this kind of task. It’s generally better to have a long-running service that can maintain database connections and manage a connection pool. You should also peek into optimizing your database—that means checking for missing indices or improving your table design. Ideally, your application should perform bulk inserts and cover all queries with the correct indices.
It sounds like you're trying to do too much all at once with your current setup. I’d suggest using AWS Glue along with Athena to do as much of the aggregating as you can before writing to Supabase. Once you have the results, drop them into another bucket and then queue them for batch writing to Supabase. Using a dead letter queue for any failed attempts would also help keep things organized and provide an audit trail.
Great idea! Managing failures properly will definitely make things smoother.
That makes sense! Keeping a persistent connection would definitely help with efficiency. Have you had any success with that kind of setup?