I've spent many years as an SRE, dealing with incidents in the middle of the night while juggling numerous browser tabs. One recurring issue is the continual mishaps that occur due to the decline in understanding after key engineers leave. The problem isn't just about having the right tools; it's more about the comprehension of the systems we work with. While we have access to various data sources—such as cloud APIs, infrastructure as code, pipelines, and runbooks—what's really lacking is a clear synthesis of this information. It often takes a week of digging through old discussions and several communications to answer basic questions about our environment.
I've developed a solution aimed at addressing this: a system comprehension layer that consolidates context from existing sources, creating a dynamic model of the infrastructure to clarify connections, ownership, and potential risks. It's designed to enhance understanding before deploying changes, not to replace current observability stacks or dashboards. Right now, it's in a rough but functional state, and I'm looking for honest feedback from those who face similar challenges. It's open for free use at the moment—please share your thoughts on whether this tool is useful and what might be missing from it. I'm especially interested in how well my perspective aligns with your experiences, any annoyances you encounter, and thoughts on what open-sourcing aspects would benefit the community.
2 Answers
Your approach seems very focused on cloud infrastructure, almost like a cloudcraft perspective. I see potential in incorporating data from popular APM tools to tie it closer to business logic. Many are developing microservices, so utilizing a standard API could help correlate elements more effectively.
I think your concept is valuable, but I wonder about its necessity given existing tools like New Relic. They already provide application performance monitoring and trace the relationships between apps and databases. Most teams have their playbooks and architecture diagrams, so I’d like to know what specific problem your tool is intending to solve. If I have to adopt yet another tool that isn't clearly better, that could spell trouble.
Thanks for your thoughts! You're correct that tools like New Relic serve their purpose well; they map dependencies and show runtime behavior. However, my focus is on the gaps between those existing tools—architecture diagrams can become outdated, and the information about ownership can be scattered or stale. I'm trying to create a more cohesive understanding of the system that isn't reliant on just one person's knowledge. When changes are made, it shouldn't require digging through multiple sources to figure out how everything fits together.

Great insight! We started with a focus on cloud infrastructure to get the foundational integrations right first, but I totally agree with you about linking to APM tools. We're currently adding those integrations with a foundation in OpenTelemetry, aiming to connect infrastructure context with business logic seamlessly. This way, users won't just see that an EC2 instance exists but also understand its role within the broader service architecture.