I'm looking for a robust solution to collect logs from all my Kubernetes pods, including the previous runs and the states of API resources and events. This is especially important when my runs fail in an ephemeral cluster, like those used in CI pipelines. While I can create a wrapper around several kubectl commands using bash or Python, I'm curious if there's a dedicated tool that can simplify this process and capture 'everything' I might need.
5 Answers
When deploying, you’ll need to actively track resources like pods, events, and custom resource statuses on your own to get the full picture. Alternatively, consider using a GitOps approach to separate these concerns; tools like ArgoCD can help with managing your deployments this way.
You might want to try using kube-prometheus-stack with Loki and Promtail (now called Alloy). It's pretty comprehensive for monitoring and logging, but keep in mind that if your deployments fail, you might not have access to logs from those failed states since the environment is ephemeral.
What happens if you can't deploy and need logs for CI on an ephemeral environment that's already gone when you're looking for a failed job? I need something more reliable, like kubectl.
If you're looking to capture all cluster state information, logs, and even logs from previous pods, consider tools like kubectl-trace or kubectl-debug. However, when things go bad, I often find myself piecing together kubectl commands anyway. Also, I've heard CubeAPM is getting popular for observability, but I haven’t tested it for capturing state in a cluster yet—a bit curious if others have managed that.
I just want to save the state of the cluster before it gets destroyed. Logging and observability should ideally happen after my code runs. If a foundational component like Longhorn fails during deploy, I need a way to keep that artifact in my CI for the failed job.
Stern is a decent option for logging, but I might be mistaken about its ability to pull --previous logs. If you know how to do that, please share!
I think it lacks that ability, but if I'm wrong, I'd love to learn how to retrieve previous logs!
Another suggestion is to set up Alloy and send all logs and metrics to a remote Loki server. This way, even if the cluster shuts down, you'll still have historical logs to refer back to, including logs from past pods.

ArgoCD is implemented pretty far down the line, but I've encountered issues like hitting the limit on the number of new certificates with LE, which caused teleport to fail. It’s frustrating when you can't capture those errors in CI runs.