What’s the Best Decentralized Logging System for a Large Retailer?

0
2
Asked By TechSavvyNinja42 On

Hey everyone! We're currently building a backend and app for a major retailer with thousands of stores, which means we have about 10,000 servers distributed globally. We're facing a challenge with our logging strategy because we have two main requirements that clash: First, we need to centralize all logs for monitoring without breaking the bank—our budget is in the millions for logging with Elastic. Second, there's constant feedback that our logs lack the detail needed for troubleshooting. Adding more logs risks increasing our costs significantly.

One idea we're considering is to implement a system of decentralized log stores, where each server maintains its own log locally while sending essential logs to Elastic for central monitoring. However, we need a streamlined way to manage this and query logs across all stores without individually accessing each server via remote desktop (which are Windows-based). I'm looking for suggestions on a decentralized logging solution that also offers centralized management. Any recommendations?

5 Answers

Answered By CloudExplorer On

Consider checking out the Grafana stack and VictoriaMetrics. Focus on using OpenTelemetry to gradually assess what you need to log while controlling bandwidth. You could set up a proxy for buffering to help enrich your data without needing code changes.

Answered By LogWizard99 On

Instead of keeping logs local, why not consider shipping them to cloud or blob storage? Keeping distributed logs can complicate networking and security, especially with so many servers. Once they're in blob storage, you’d have different querying options depending on your cloud provider.

RetailGuru88 -

That's a solid point! I hadn't thought about blob storage. I've been looking into Grafana's Loki; it might be a low-cost alternative. Our client needs reliability, so they still want local logs, but we already manage network security.

Answered By QuickwitFan On

I’ve handled around 80 TBs of events/logs using Quickwit backed by S3. As long as your team knows how to search, query responses are quick—usually under 3 seconds when users send logs to S3 first and then process them through Quickwit.

Answered By FutureGadgeteer On

In a year, you might find your custom solution is harder to maintain than going with a managed service. If you stick with Elastic, have you checked out their cross-cluster search documentation? That could help you manage the load.

TechSavvyNinja42 -

Totally agree! I often suggest the easy route, but when decisions go the other way, we have to find the best next option. Thanks for the link!

Answered By ElasticPro2023 On

Why not run another instance of Elastic on each machine for local storage? You could send only error logs and selectively sampled info logs to the central Elastic instance. This way, you'd get a good balance of detail while managing costs.

Related Questions

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.