I'm developing an AI service where a single request often triggers multiple asynchronous tasks. This includes processes like multiple large language model calls, retrying failed model requests, batching, and managing fan-out and fan-in patterns. I initially chose DTQ (Distributed Task Queue) because it's lightweight and simple to integrate into existing code. However, I've found that as my needs grew, the minimalism of DTQ became limiting. Once I started implementing complex async flows, handling partial failures, ensuring idempotency, and needing visibility into stuck requests, I realized I was rebuilding a lot of structure that I hoped DTQ would provide. I'm looking for a solution that's stronger than a basic task queue but lighter than a full durable execution framework. I'm interested in alternatives designed for AI workloads, specifically for managing LLM calls and retries. Have others experienced these limitations with DTQ or similar lightweight solutions? What strategies did you use to overcome them, and did you switch to a more durable execution framework or develop your own abstractions?
2 Answers
Could you provide a few examples of your workflows? I have a framework called Darl that fits well with what you're trying to achieve. It's designed for general data science tasks but can work for AI if you handle the non-determinism of your LLM calls correctly. It offers all the benefits of DTQ but adds features like flow tracing, intermediate node replay, and built-in automatic checkpointing.
Does Darl run without needing a database or message broker, or is it limited to single instances?
You might want to check out DBOS. It's a lighter alternative to Temporal and might address the issues you've encountered. Plus, it has pretty useful documentation for their queue system.

The README is quite detailed, but maybe a summary of the main features would help newcomers like me get the gist quicker.