Hey everyone, I'm a recent graduate with around 1.5 years of experience, including a strong internship at a cloud provider and several personal projects under my belt. I've previously set up Prometheus and Grafana for monitoring over 50 nodes, which helped reduce incident response time by roughly 20%. I'm currently working on a project to create a centralized monitoring and observability system for a pharmaceutical company with about 1,500 to 2,000 employees, and the scope involves various components like metrics collection, centralized logging, network device monitoring, database monitoring, application monitoring, and security detection. While I'm confident in handling the metrics and basic logs, I'm feeling overwhelmed with the broader system integration and advanced alerting aspects. I'm looking for honest opinions on whether this is a task suitable for a strong junior developer, or if it's typically reserved for more experienced professionals. I'd also appreciate any advice on common pitfalls to avoid and recommendations for a starting setup or tech stack. Thanks in advance for your insights!
5 Answers
Getting the basic setup isn't impossible. You could manage to get something started with helm charts for the Prometheus stack. The initial deployment can be handled reasonably by a junior, but configuring everything for meaningful metrics and alerts is where the challenge lies. This part can be complex, and making it work efficiently will require some experience to get it right.
You could opt for a simpler solution depending on your needs. If cost isn't an issue, platforms like Splunk or Datadog could streamline things for you. Much easier to implement than a full DIY observability stack, just consider the expenses involved with those tools compared to running your own system.
It's great that you're aiming high, but I recommend narrowing down your focus at the start. Before going for the entire observability stack, identify what immediate systems impact the business most. Set up monitoring for those crucial systems first, and build from there. Creating alerts for key metrics that directly affect uptime can be more beneficial than trying to manage everything at once. Just remember, the full RCA (Root Cause Analysis) can be tricky without proper systems in place, so tackle it step by step.
Honestly, I don't think this is a junior-level task. If you're looking at integrating a full observability framework with SIEM and comprehensive monitoring, that's a major team effort. If you're just planning on gathering logs and metrics from one cluster, that’s doable, but if you're aiming for everything, you'd need a lot of support. Don’t underestimate the enormity of this project, especially at a larger scale.
From what you've described, it seems like you're already handling quite a bit, which indicates some mid-level capability. Juniors usually execute tasks with guidance, while mid-levels can manage without it but might not grasp the bigger picture. Given the complexity of building a whole observability system, it definitely leans more towards mid-level and above, especially if you're looking for a production-grade setup.

That's the trade-off! Splunk and Datadog are great for quick setups but can get pricey. If you're going the self-hosted route to save costs, just be prepared for the learning curve and the maintenance workload.