I'm running a Node.js SaaS that has seen unexpectedly high data processing demands, and it's starting to affect my infrastructure. Currently, I'm hosted on a single EC2 instance, where I have two main containers: one for the API (duplicated and behind a load balancer) and another dedicated to handling background tasks. The issue I'm facing is that critical tasks are getting delayed, and my worker container is restarting 6-7 times per day due to memory spikes.
The workers are responsible for tasks like making API calls to external services, heavy data processing and parsing, document generation, and running analysis tasks on large datasets. Some of these jobs are time-sensitive, while others can take hours to complete.
I'm considering a couple of options, such as using managed Redis (AWS ElastiCache) and switching to SQS for job management. What would be the best approach to scale my workers based on this workload, and why?
5 Answers
I'd recommend putting your API and workers on separate instances for better performance. Dockerizing both and utilizing a managed service from AWS like ECS or Elastic Beanstalk can help you avoid complete application failure due to an issue in one part. For your workers, looking into SQS or Bullmq for job queues can also help manage tasks better and possibly alleviate those memory issues.
You might want to think about using AWS Batch with Fargate or EC2 Spot instances for your background tasks. For your API, it sounds like you already have an Elastic Load Balancer, but do you not have an Auto Scaling Group set up? That could allow you to automatically add more instances when needed.
Before making major architectural changes, I'd suggest tackling the memory spikes first. There could be bugs in your application causing these issues. Just shifting your infrastructure might not solve the root problem.
Why did you choose to host everything on a single EC2 instance? There might be underlying application level issues causing the memory spikes. Are your API calls asynchronous? Do you have proper timeouts and backoff strategies? All of these factors can strain your CPU resources. If you're handling critical tasks, switching to ECS Fargate could be a wiser choice; a single EC2 instance may not offer the resilience you need.
It's a bit unclear what you're specifically looking for. Are you in need of more computational resources, like a bigger instance or serverless options? Or do you need a better multiprocessing setup to ensure that quick tasks get done on time, perhaps with a separate queue?

Related Questions
How To: Running Codex CLI on Windows with Azure OpenAI
Set Wordpress Featured Image Using Javascript
How To Fix PHP Random Being The Same
Why no WebP Support with Wordpress
Replace Wordpress Cron With Linux Cron
Customize Yoast Canonical URL Programmatically