I'm looking for effective strategies and tools for managing a large number of AWS accounts, specifically from hundreds to thousands. I'd love to hear about your experiences and get pointers to any knowledge bases, whitepapers, or YouTube videos that cover this topic. Here are some specific points where I'm seeking advice:
1. What tooling and approval processes do you use for both core infrastructure (like accounts, networking, permissions, auditing, and policy enforcement) and for development workloads (like EKS/ECS, tasks, and load balancers)?
2. Do you enforce the same tooling and approval process for core infrastructure and development teams, or do you allow the dev teams to choose their own tools and procedures for managing workloads?
3. Do your developers run Terraform or other tools from their local machines, or do you utilize GitOps or DevOps tools like Spacelift, Firefly, or Terraform Cloud?
4. How do you organize your Git repositories? Are they structured by account or environment? And how do you ensure that the code used for pre-production is identical to what's deployed in production?
I realize this is a broad question, but I'm seeking real-world experiences that can guide me as I look to scale in my own environment.
5 Answers
If you're managing thousands of accounts, it might be worth considering enterprise support from AWS. They can provide a dedicated team to help you navigate this process, though it does come with a cost.
For our team, we have a clear split in tooling:
- Core infrastructure is handled with Terraform alongside Atlantis by the platform team only.
- Dev teams decide their tools for workloads (like EKS or services).
Developers have a landing zone with pre-defined VPCs and roles, and they can't modify the core infrastructure. All changes go through CI/CD pipelines using Atlantis for Terraform to maintain an audit trail. For scaling, options like Spacelift or Terraform Cloud work better than Atlantis for 1000+ accounts.
As for Git structure, we separate core and workload repositories and use the same modules for production and pre-production environments while promoting exact git SHA versions to ensure consistency.
I recommend looking into the Landing Zone Accelerator. It’s specifically designed to help manage AWS accounts effectively, and you can find more details on AWS's official site.
Using Tags and Control Tower can help keep everything organized. You should also consider if your root account passwords are standardized for better security. It's incredibly important to have clear policies in place.
I disagree with the idea of putting unrelated workloads in one account; that can increase security risks. Having separate accounts helps mitigate potential issues.
I've worked with Amazon Web Services for a long time, and I can say that without proper automation and protective measures, just relying on a multi-account setup can lead to complexities.
We focus heavily on strategy, architecture, and design before jumping into implementation. Setting a solid foundation in terms of account management, networking types and placements, and security standards is key. Additionally, don't forget about compliance, observability, and automation since they'll play major roles as you scale.

That sounds like a solid structure! I've seen teams struggle with managing changes without proper versioning, so promoting SHAs is a smart approach.