I'm working on a Monte Carlo simulation script that takes about an hour to run and produces slightly different results each time. I want to run this script roughly 100 times in parallel on a powerful AWS instance. I've checked out AWS Batch and SageMaker, but I'm unsure how to set everything up for this task. What's the easiest way to run these jobs in parallel?
7 Answers
Does your script use up an entire instance? Also, do you need specific inputs for each job? If you can confirm they run independently, you might want to try the EC2 RunInstances API with user data on a self-made AMI. Not the prettiest solution, but it's simple and gets the job done.
You're right; Batch is a good choice for this, although it's not very intuitive. You'll need to set up a cluster and containerize your job, but it gives you a framework for managing tasks effectively, even if it does involve a bit of setup.
Check out Coiled or Dask; they work really well with AWS and offer an easier interface for running parallel jobs. You can find more info in their documentation to get started.
The most straightforward method is to use AWS Batch since it fits your long-running tasks. You just need to start jobs and pass parameters to each instance you create. It's pretty effective for this type of workload.
The best solution depends on your script's complexity and resource demands. It sounds like putting your script into a Docker container could unlock more options for running it across various environments. AWS SageMaker provides a range of distributed compute options, but if you don't need all that, you might find it tedious to set up.
For Docker solutions, AWS Batch is solid for scheduling long-running tasks, though it could be overkill for your needs. Alternatively, using ECS to run tasks might be ideal if you're containerized.
Using AWS Glue could also be beneficial; it can drastically reduce your simulation time if you're familiar with scaling properly.
A classic approach is to use SQS to queue up your jobs and set up EC2 instances in an Auto Scaling Group (ASG) to handle the workload. If your jobs can be re-triggered on failure, consider using spot instances to save costs. Since this is a one-time need, you may need to adapt the method a little or utilize Lambda for better control over how many instances you scale up.

I was hoping for something a bit more hands-off, but I'll give this a try. Cheers!