Hey everyone! I'm looking for strategies to cut down on the noise in our observability data, especially when it comes to logs. We're overwhelmed by low-signal logs, particularly from info and debug levels, which makes it really tough to identify actual issues and increases our storage costs. We've attempted some basic filtering to avoid missing important events and have asked developers to log less, but the noise keeps creeping back. I'm curious if any of you have found effective methods that worked for your team. Bonus points for any horror stories or 'aha' moments you can share! Thanks!
5 Answers
Hey there! This issue is pretty common, so here are a few tips that could help:
1. Try to log at the edge of your systems instead of in the core code. Logging at higher levels, like the route/controller level, gives you better context and significantly reduces log volume.
2. Consider moving to structured logging. Using key/value pairs rather than just string blobs makes it easier to filter out the noise and focus on important data.
3. You could also drop or sample logs based on logger names or content. Setting up OpenTelemetry processors can help you drop high-volume logs (like health checks) using regex.
4. Lastly, think carefully about what severity levels you’re logging and be wise about what gets kept versus discarded.
Hope this helps! I even wrote a blog on cutting observability costs and reducing data noise that might be useful. Check it out!
If you're using Grafana with Loki, it has features to recognize common patterns in logs. For instance, logs from a specific source can have their own patterns identified, making it easier to spot noise and filter it out. I often think of success messages—like 200 Ok logs—as noise. Instead of logging them, increment a metric and drop the message. This approach handles a surprising amount of clutter.
Definitely use log level enforcement, sampling, and structured logging to tackle the noise. Route low-value logs to cheaper storage or drop them entirely—quality should always take precedence over quantity.
Right, but how do I figure out which logs to sample or drop? Are there any automated solutions? Our company is growing, and our log usage is really variable.
One simple approach is to turn off debug and info logs unless you're actively using them. Focus on warning and error logs, as they usually indicate real issues. You could also create tickets for developers who add too much logging, or implement PR checks to prevent chatty logs from being merged.
That sounds great, but we often need a lot of those info and debug logs for troubleshooting. It’s crucial to determine which logs are actually necessary.
Using stats to analyze and filter logs helps a lot too. Focusing on derived metrics rather than individual success logs could reduce a significant amount of noise.
Thanks for the tips and the blog link! But do you have any advice on how to make the logging process less tedious?