I'm a junior software developer trying to create a semi-real-time monitoring system for my application, with minor delays of under 15 minutes. My application generates various events with the following states: `queued`, `error`, `processed`, and `to_be_requeued`. I want to be able to monitor when the state goes to `error` and also track if an order enters the `queued` state but fails to move to `processed`, which I would like to flag as an error if the `timestamp` exceeds a certain threshold.
Currently, I've set up a proof of concept that dumps raw events into a Timescale database, and then a web API polls and processes them at defined intervals. Unfortunately, the performance isn't what I expected, and I'm looking for ways to enhance it. I came across the ELK stack and wonder if it could fit my needs. My understanding is that Elasticsearch is primarily a key-value store and Kibana is a visualization tool for that data. Can the ELK stack handle my situation? If not, what other architectures or approaches should I consider? I'm also looking for informative resources, and experiences from anyone who has tackled a similar task. Thanks!
3 Answers
The ELK stack is neat, but it might be more than what you actually need. Logstash offers quite a bit of functionality, but you could explore the LGTM stack too. It allows you to set alerts based on PromQL queries and error rates. Also, consider using Vector to convert events into metrics for alerting.
You should consider turning your event types into metrics to visualize them effectively. Think about how you're collecting these metrics, where you're storing them, and how you're visualizing them. Metrics can give you insights into what’s happening with your events.
Right now, we push event logs into a Kafka topic using Fluent Bit. It's crucial for us to know immediately if an error occurs, so a web UI or notifications via Slack would be super useful. Plus, being able to visualize the events that led to an error would save a lot of time compared to manually grepping through files. I'm curious if Elasticsearch can handle past event states and detect issues like a hanging queue.
The PIG stack could be another viable option for your monitoring needs. Grafana has excellent alert features, and you can easily set up alerts to send notifications to Slack.
I want to give ELK a shot since it's my first time working with stacks. A proof of concept could help me understand its search capabilities compared to manually going through log files. However, I'm still worried about how to flag errors or detect missing event states, which seems important no matter what stack I use.