I'm trying to make a shift into Site Reliability Engineering (SRE) and would love some guidance. Currently, I'm working in a TechOps role where my main tasks include debugging production issues, monitoring systems, and handling incidents at an L1/L2 level. I've got some experience with manual debugging using browser DevTools, basic API investigation, and tools like New Relic and Grafana for monitoring. I know some Linux fundamentals, scripting with Bash and Python, and I've dabbled in cloud concepts mainly with AWS. I'm also learning Docker and planning to dive into Kubernetes soon.
Here's what I need help with: What foundational skills should I focus on to move from TechOps to SRE? Do I prioritize getting a cloud certification or should I get hands-on with tools like Kubernetes and Terraform first? Are there specific projects I could work on to enhance my profile? How do I know when I'm ready to apply for SRE positions? Any advice from those who have made a similar jump would be greatly appreciated!
1 Answer
If you're aiming for SRE, focus more on infrastructure and automation rather than just cloud tools. Understand the basics of servers, networking, and storage. Tools like Terraform and IaC are important, but knowing why you're automating things is crucial. It's all about getting a grip on the different technologies you'll be dealing with. Learning about failover systems and how to troubleshoot OS behaviors will definitely help you in the long run.

That's helpful, thanks! Can you suggest what specific skills I should target in my learning plan?