Hey everyone, I'm working on a project for a large retailer with thousands of locations, and our backend consists of around 10,000 servers spread worldwide. We're facing a bit of a dilemma with logging requirements. On one hand, we need to centralize all logs for monitoring; however, our budget for logging—using Elastic—is substantial, running into millions of euros annually. On the other hand, when bugs arise, we often hear that the logs lack detail, but we can't increase the logging volume without exceeding our budget constraints.
One idea I've been considering is a decentralized logging system where each server keeps its logs locally. The most important logs would then be sent to Elastic for central monitoring. We're looking for a way to manage these decentralized logs without having to connect to each server individually, especially since they run on Windows. Does anyone know of a system that can support decentralized log storage but also has centralized management capabilities?
5 Answers
Have you thought about just shipping logs to a cloud or blob storage instead? Keeping logs distributed seems complicated, especially with security and maintenance for 10,000 servers. Storing them in the cloud could give you more querying options, plus it simplifies management.
I managed to push 80 terabits of logs/events into Quickwit backed by S3; users just need to know how to search the system. Queries typically return results in under three seconds when configured well.
Honestly, when your custom solution starts failing, you'll likely realize that it would have been better to invest in a reliable managed service from the beginning. If you want to stay with Elastic, have you checked out their cross-cluster search? That might help!
Absolutely! I totally agree with you. We often discuss this, but I can only suggest solutions, and if the decision-makers aren't on board, we have to look for alternatives. Thanks for the link!
Why not run a separate Elastic instance on each server to catch info-level logs? You could send only the error logs and a sampling of info logs to the central Elastic instance to manage your costs better.
Don’t reinvent the wheel! Try out the Grafana stack with Victoria Metrics. Use OpenTelemetry to adjust what you log, monitor bandwidth and timing, and then ramp up once you find what's necessary. You might also consider managed platforms or creating a central hub where your clients can be tenants with specific access controls.
That's a good point! I hadn’t considered blob storage before. But I’ve been looking into Grafana's Loki as a cost-effective logging solution. They require offline capabilities for reliability, which is why we need logs stored locally anyway, but security processes are already in place.