Queue Design for Long-Running Agents
Once an agent task runs longer than a normal request cycle, the architecture changes.
You are no longer building “a chat feature”. You are building a durable workflow with a model in the middle. That distinction matters because the queue is only the delivery mechanism. The workflow is the state machine that survives retries, worker restarts, duplicate messages, and user cancellation.
Amazon SQS standard queues are a good reminder of the default failure model: they provide at-least-once delivery, messages can be delivered more than once, and they may arrive out of order.1 Visibility timeout is only a lease on a message while a consumer is working it; if the work outlives that lease, the message can reappear.2 Dead-letter queues exist because some jobs will fail repeatedly and need to be isolated for inspection instead of retried forever.3
If your agent workflow does not assume those constraints, it will eventually create duplicates, stall on poison jobs, or lose track of what already happened.
TL;DR
- Model the job as a persisted state machine.
- Assume at-least-once delivery and duplicate execution.
- Put idempotency keys on every side-effecting step.
- Persist checkpoints after each irreversible action.
- Make cancellation stop future work, not just hide it in the UI.
- Send poison jobs to a DLQ or terminal failure state with enough context to debug them.
The queue is not the workflow
The queue moves messages. It does not know whether the agent already sent the email, wrote the ticket, or charged the card.
That is why the job record has to carry the real state:
- current step
- last durable checkpoint
- attempt count
- next retry time
- external object ids
- terminal status
Temporal’s platform docs make the same point from a different angle: Temporal is built around crash-proof execution, with workflows resuming where they left off after crashes, network failures, or infrastructure outages.4 That is the right mental model for long-running agents too. The durability lives in the workflow state, not in the transient worker process.
A state machine beats a promise chain
The simplest version that holds up in production is an explicit state machine.
| State | Meaning | What changes it |
|---|---|---|
queued | Job exists, not yet started | Worker claims the job |
running | Step execution in progress | Step succeeds, fails, or pauses |
waiting_for_approval | Human gate is pending | Approval or timeout |
retrying | A step failed and is scheduled again | Retry policy or backoff |
cancelled | User stopped the job | Cancellation request or revocation |
failed | Final failure with a reason | Exhausted retries or fatal error |
dead_lettered | Message was isolated for inspection | DLQ redrive or poison-job policy |
completed | Work finished successfully | Final output persisted |
If you cannot tell which state the job is in, you cannot tell what should happen next.
Idempotency is not optional
Duplicate delivery is not a corner case. It is the default you design around.
Stripe’s idempotent request docs are useful even outside payments because they make the rule concrete: if you send the same request with the same idempotency key, the API should treat it as the same operation instead of creating new side effects.5 Stripe also documents request ids and replay behavior for troubleshooting, which is a good clue that idempotency and observability belong together.6
For agent jobs, the practical pattern is:
- generate one stable
run_id - derive a step-specific idempotency key
- store the result against that key
- return the stored result on retry instead of repeating the action
Example:
idempotency_key = run_id + ":" + step_name + ":" + operation_name
That is enough to keep a retry from creating five tickets or sending five emails.
Checkpoints are where durability happens
Do not wait until the end of the workflow to persist state.
Persist after every irreversible step:
- context gathered
- plan drafted
- approval requested
- side effect executed
- external id captured
- final summary written
If the process dies after step 4, the next worker should resume from step 5, not rediscover the whole world.
That is also where compensation logic lives. If step 4 succeeded but step 5 failed, the workflow should know whether to retry, compensate, or stop and ask a human.
Cancellation has to be real
A cancel button that only hides the spinner is a UX lie.
Cancellation should:
- flip the terminal state in the job record
- stop dequeuing new steps
- revoke or shorten future retries
- propagate to in-flight work where possible
- stop presenting progress as if the job is still active
If you are using a durable workflow engine, cancellation should become part of the workflow semantics. If you are using a message queue, you still need the job record to remember that future steps are off limits.
Progress should come from state transitions
Long-running agent jobs lose trust when the UI goes silent.
The fix is not fake percentages. It is evented progress:
queuedrunning context assemblywaiting for approvalretrying tool callwriting final resultcompleted
Those labels are useful because they correspond to actual durable transitions. When the UI shows them, the user can tell the system is alive without assuming it is finished.
Dead-letter queues are for diagnosis, not shame
SQS DLQs exist so you can isolate messages that do not process successfully and inspect why they failed.3
That is exactly how poison jobs should be handled in agent systems:
- keep the original payload
- record the last error
- record the attempt count
- record the step that failed
- record the external ids already created
If the workflow failed after creating a real side effect, the DLQ entry should make that obvious. That is the difference between “the job died” and “the job died after sending the customer an email and creating a support ticket.”
What to log for a long-running job
At minimum, every job log should carry:
run_idstep_id- queue name
- attempt number
- visibility timeout or lease metadata
- idempotency key
- current state
- checkpoint version
- external ids created so far
- cancellation status
- failure reason if terminal
That is enough to reconstruct almost every support case I have seen.
What not to do
Do not bury a long-running agent inside one request that holds a connection open until it “finishes”. That usually gives you poor cancellation, poor retries, and no durable checkpoint after the first failure.
Do not rely on the model to remember where it was after a crash. The model is not the durable state. The workflow record is.
Do not keep retrying a step that can create side effects unless the step is idempotent or wrapped in a confirmation flow.
The practical standard
If I trust a long-running agent workflow, it is because I can answer these questions from the job record:
- What state is it in right now?
- What was the last durable checkpoint?
- What side effects already happened?
- What would a retry do?
- What happens if the user cancels it now?
- If it fails again, where does it go?
If those answers are fuzzy, the queue design is still prototype-grade.
Footnotes
-
Amazon SQS standard queues: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/standard-queues.html ↩
-
Amazon SQS visibility timeout: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html ↩
-
Amazon SQS dead-letter queues: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dead-letter-queues.html ↩ ↩2
-
Temporal docs: https://docs.temporal.io/ ↩
-
Stripe idempotent requests: https://docs.stripe.com/api/idempotent_requests ↩
-
Stripe request IDs: https://docs.stripe.com/api/request_ids ↩