Hey everyone! I'm working on a project for a big retailer that has thousands of stores, each with its own server. We're dealing with about 10,000 distributed instances globally. The challenge we face is our logging system. We have two main concerns: first, we want to keep our logs centralized for monitoring, but we have to be careful with costs since we're looking at potentially millions of Euros each year for logging. Second, when bugs arise, we often hear complaints about the logs lacking detail, but adding more logs could push us over budget.
One idea I'm considering is implementing decentralized log stores, where each server keeps its logs locally while also sending the most important ones to Elastic for central monitoring. I'm looking for a solution that allows us to connect to each store and run queries without needing to access each server individually via remote desktop (which are Windows machines, by the way). If anyone has insights or knows of a system that meets these requirements, I'd really appreciate any input!
5 Answers
In a year, when your custom solution is struggling, you'll wish you had invested in a managed service from the get-go. If you're sticking with Elastic, have you looked into their Cross-Cluster Search feature? It might help mitigate some of your issues.
If you don’t need to keep the logs local, consider shipping them to cloud or blob storage. Keeping them distributed seems excessive unless someone is on-site needing access. Once in blob storage, you’ll have various options for querying depending on your provider.
Good point! I hadn’t thought about blob storage. I found Loki from Grafana as a potential low-cost storage option, but our client insists on having logs available on the store servers for offline functionality and network security is already managed.
I was dealing with 80 TB of events/logs using Quickwit backed by S3. Users were able to search efficiently, returning most queries in under 3 seconds. Just send logs to S3 and let Quickwit handle it from there.
Don’t reinvent the wheel; explore the Grafana stack or VictoriaMetrics. Start with OpenTelemetry and be smart about what you log to manage bandwidth. You can set up a proxy to buffer your logs for better control and enrichment, keeping everything organized for your clients.
Consider running another Elastic instance on each local machine to gather all the info-level logs. You can then send only error logs and a sampled selection of info logs to your main Elastic instance.

Totally agree! We often suggest that, but the decision-makers aren’t always on board. Appreciate the link!