I'm a new developer tasked with scaling a specific component of our backend to handle multiple long-running tasks simultaneously. We have these 'runs' that vary in duration (a few hours), depending on the data, and they involve a lot of third-party API calls, database interactions, and data processing. Currently, we have a basic infrastructure setup on AWS with some EC2 instances, but we want to be able to run more than one task at a time without being bottlenecked by our resources.
I've designed a potential solution where we use an API hosted on EC2 to enqueue tasks into an SQS queue. A Lambda function checks the number of active tasks and starts new ones on ECS Fargate when possible, while tasks send heartbeats to maintain visibility in the SQS queue. Redis would handle rate limiting, and I've incorporated CloudWatch and Sentry for observability. I'm looking for expert feedback on this design and any important considerations I might be missing for horizontal scaling.
3 Answers
Scaling and ensuring reliability is tricky, especially as a fresher. If you need hands-on help, that's what consultants are for! But seriously, make sure you understand how retries work in your SQS setup—this could save you a ton of headaches with transient failures down the line.
Your plan follows a solid pattern for horizontal scaling, especially with using queues and containerized services. A producer-consumer model is a great first step. However, I believe involving Redis solely for rate limiting might be unnecessary unless you're already leveraging it for caching. Consider controlling your rate limits at the API level before it hits the ECS tasks—this could simplify things significantly.
Don't forget the importance of monitoring and adjusting your infrastructure! Since you're working with ephemeral instances, you might face challenges in measuring when they are overwhelmed. Implement a load balancer to dynamically allocate resources based on actual usage metrics. This can help create or terminate instances seamlessly without impacting your task runs.

Related Questions
How To: Running Codex CLI on Windows with Azure OpenAI
Set Wordpress Featured Image Using Javascript
How To Fix PHP Random Being The Same
Why no WebP Support with Wordpress
Replace Wordpress Cron With Linux Cron
Customize Yoast Canonical URL Programmatically