We operate a team of around 100 people on Amazon EKS, primarily using EC2 nodes alongside some Fargate for workloads. We're integrated with various AWS services like RDS and S3. Recently, during security audits, we've been flagged for several gaps, and leadership is asking for a solid hardening plan as we consider expanding to more namespaces.
We've attempted some basic AWS guidelines and implemented a few OPA policies, but we're still running into challenges such as overly broad IAM mappings in aws-auth and potential pod escape risks during our testing.
A recent incident, like the ChangeHealthcare breach, raised our concerns further. They faced a security issue where attackers exploited a misconfigured IAM role in their EKS cluster, allowing them to move laterally through pods and compromise patient data. We definitely want to avoid a situation like that.
I'm looking for advice on where to prioritize our efforts. Specifically, I'm searching for best practices that are proven to work in production environments on aspects like:
* IAM and RBAC configurations that work effectively (any IRSA examples?)
* Network policies combined with security groups for proper workload segmentation
* Image scanning and runtime checks that don't negatively affect performance
* Monitoring solutions that can identify drift or anomalies early on
* Node hardening and adhering to pod security standards
What checklists or strategies have you found useful?
7 Answers
It might be beneficial to engage the auditors proactively and involve them in establishing the practices they identify as concerns, rather than waiting for them to evaluate what you have already built.
Start with the basics: enforce least privilege using IRSA and RBAC, and use NetworkPolicies with security groups to segment workloads. Integrate image scanning during your CI/CD process with tools like Trivy or Grype, and pair that with runtime checks using Falco or OPA policies. For node hardening, AWS AMIs or Bottlerocket are great options, combined with thorough audit logging and anomaly detection. Remember, these should be treated as part of a layered defense strategy. Often, breaches stem from overlooked basics rather than complex CVEs. CIS Benchmarks for EKS can serve as a practical checklist in these cases.
I've been working on a diagram to help prioritize security measures—it's a bit opinionated, but it covers a lot of ground! You can check it out here: https://kubesec-diagram.github.io/
There are some straightforward improvements you can implement:
- Use very small containers with minimal tooling to limit lateral movements and escapes.
- Stick to pod security standards, opting for Restricted or Baseline levels, and segment your workloads into namespaces according to their required privileges.
- Avoid attaching IAM roles to your nodes. Instead, give IAM service-linked roles only to those workloads that truly require AWS API access, and ensure roles are specific to each service.
- Spend extra effort fine-tuning network policies and IAM roles, particularly in namespaces that manage development workloads. Being clear on who has access and how they're utilizing the pods is crucial.
Pod escapes are definitely a concern, but honestly, I've seen many clusters struggle with RBAC and network segmentation before they run into any critical vulnerabilities. Addressing those fundamentals is key.
I’d be curious to know what specific gaps were identified in your audits. If confidentiality is an issue, I completely understand. I’ve been concerned about our clusters lately too; we found many pods running with overly broad IAM roles due to a misconfiguration, and we didn’t catch it until much later. I’d appreciate any insight on what scanning tools you’re using to uncover these issues as well.
Adopting IRSA was a game changer for us. By switching from node-level IAM roles to per-pod identities, we've drastically reduced the risk of lateral movements within our system. It addresses a lot of those concerns effectively right off the bat.

Related Questions
Can't Load PhpMyadmin On After Server Update
Redirect www to non-www in Apache Conf
How To Check If Your SSL Cert Is SHA 1
Windows TrackPad Gestures