Hey everyone! We're in the process of developing a software system for a large retailer with thousands of stores, each running its own server. This means we're dealing with around 10,000 distributed backend instances worldwide. We're facing a bit of a dilemma when it comes to logging, with two conflicting requirements. On one hand, we need all logs centralized for monitoring, which is set up with Elastic, but we want to keep costs manageable—we're looking at potentially millions per year for logging alone. On the other hand, we often receive complaints that our logs lack detail when bugs occur, but increasing the amount of logging might blow our budget. One idea I had was to implement decentralized log stores where each server stores logs locally and sends the critical ones to Elastic for central monitoring. We need a solution that allows us to connect to each store for querying without having to log in to each server individually (they're all Windows). Does anyone know of a system that can manage decentralized logging while still allowing for central oversight?
5 Answers
In a year, when whatever custom solution you try falls apart, you might realize it would have been smarter to invest in a managed service from the start. If you're dead set on using Elastic, have you checked out their cross-cluster search feature? It could be beneficial.
If you don’t need to keep logs local, consider shipping them to cloud or blob storage. Keeping them distributed can be tricky unless there's someone on-site who needs access. Once they're in blob storage, you’ll have different options for querying based on the provider.
That’s a good point! I wasn’t considering blob storage before, but I did come across Loki from Grafana—it looks promising as a low-cost storage option. Our client has some strict reliability requirements, wanting logs stored locally anyway since they prefer working offline.
I used to push around 80 TBs of logs into Quickwit backed by S3, and it worked well! As long as your users know how to search, most queries came back in under 3 seconds. You can send all logs to S3 and then ingest them into Quickwit from there.
Avoid reinventing the wheel! Check out the Grafana stack and VictoriaMetrics. Start with open telemetry and limit what you log at first to get a handle on bandwidth and timing, then gradually scale up. Make sure to coordinate the log structures and buffer them in a proxy to enrich the data beyond just the code.
Why not run another Elastic instance on the local machines that only collects info level logs? You could send just the error logs and a sample of info logs to the centralized Elastic instance, saving some space and resources.
Definitely agree! I mention that to my team all the time, but it’s tough when the decision-makers aren’t on board. Thanks for the link!