Hi everyone! We're currently working on an architecture for a public-facing FinTech application that utilizes multiple microservices and is hosted on AWS. We'd love to hear from anyone who has experience building or managing similar systems at scale.
Here are a few specific areas where I'd appreciate your insights:
1. **EKS Cluster Strategy**: Should we deploy all services within a single EKS cluster using namespaces and other security measures, or is it better to have multiple clusters for improved isolation and reduced risk? What's the typical approach for FinTech or regulated workloads?
2. **EKS Auto Mode vs. Self-Managed**: Considering we'll have high and unpredictable traffic with strong security requirements, should we opt for EKS Auto Mode and managed node groups, or would self-managed worker nodes offer better control over AMIs and compliance? What has worked in your real-world setups?
3. **Observability & Data Security**: We need efficient APM, logging, and alerting solutions, but we're concerned about handling sensitive data. Is it safe to use tools like Datadog and New Relic, or is self-hosting better? How do teams generally manage PII masking and compliance?
4. **Security Best Practices**: Any lessons or recommendations regarding network isolation, secrets management, pod-level security, or zero-trust models would be greatly appreciated.
If you have implemented a similar setup on EKS, I'd love to hear about your architecture, any trade-offs you faced, and what you would change if you could. Thanks in advance!
2 Answers
I recommend skipping node groups and using Karpenter for better flexibility. Start with a single cluster for each environment like dev, staging, and production. For monitoring, the kube-prometheus stack with Loki could work well, and self-hosting is often significantly cheaper than using external providers. For secrets management, consider using AWS Secrets Manager or HashiCorp Vault and fetch them with ExternalSecrets. Service meshes can be complex; not always the best fit for newcomers. For nodes, you can choose between public and private subnets depending on your requirements.
Before diving into architecture, it's essential to clarify your business requirements first. What are your priorities for availability and compliance?
Our app needs to be highly available, with business and security compliance firmly in place, including audits.

Could you explain your 'No node groups' approach further? I’m concerned about different hardware needs for various apps—wouldn’t using two autoscaling node groups for different resource requirements be a better choice?