I've been grappling with distributed tracing and Spark traces in Tempo for a while now, but I'm finding it hard to pin down which Spark stages are actually escalating our costs. It's frustrating because I've heard of teams reducing infrastructure expenses by over 100x just by identifying inefficiencies in their Spark jobs. We want to link stage-level resource usage to real costs on AWS, but currently, tracing doesn't provide meaningful insights. I can't even pinpoint which stages are using the most CPU, memory, or disk I/O, nor can I correlate that data with our AWS spending. I've tried using the OTel Java agent with Tempo, but the spans don't align with the Spark stages in any useful way. While the Spark UI helps a bit, it's not practical for ongoing cost analysis. I'm starting to doubt if distributed tracing is the best route for understanding our costs. Should I be looking into metrics and Mimir instead? Or is there a better way to organize Spark traces in Tempo for proper cost breakdown? I've done my homework, including reading docs and asking various AI tools, but I'm still at a standstill. Any help or personal experiences would be greatly appreciated!
1 Answer
Distributed tracing excels at showing what happened during execution, but it often misses the mark when it comes to identifying costs. Just a heads-up!

Related Questions
How To: Running Codex CLI on Windows with Azure OpenAI
Set Wordpress Featured Image Using Javascript
How To Fix PHP Random Being The Same
Why no WebP Support with Wordpress
Replace Wordpress Cron With Linux Cron
Customize Yoast Canonical URL Programmatically