Queue Design for Long-Running Agents

Once an agent task runs longer than a normal request cycle, the architecture changes.

You are no longer building “a chat feature”. You are building a durable workflow with a model in the middle. That distinction matters because the queue is only the delivery mechanism. The workflow is the state machine that survives retries, worker restarts, duplicate messages, and user cancellation.

Amazon SQS standard queues are a good reminder of the default failure model: they provide at-least-once delivery, messages can be delivered more than once, and they may arrive out of order.¹ Visibility timeout is only a lease on a message while a consumer is working it; if the work outlives that lease, the message can reappear.² Dead-letter queues exist because some jobs will fail repeatedly and need to be isolated for inspection instead of retried forever.³

If your agent workflow does not assume those constraints, it will eventually create duplicates, stall on poison jobs, or lose track of what already happened.

TL;DR

Model the job as a persisted state machine.
Assume at-least-once delivery and duplicate execution.
Put idempotency keys on every side-effecting step.
Persist checkpoints after each irreversible action.
Make cancellation stop future work, not just hide it in the UI.
Send poison jobs to a DLQ or terminal failure state with enough context to debug them.

The queue is not the workflow

The queue moves messages. It does not know whether the agent already sent the email, wrote the ticket, or charged the card.

That is why the job record has to carry the real state:

current step
last durable checkpoint
attempt count
next retry time
external object ids
terminal status

Temporal’s platform docs make the same point from a different angle: Temporal is built around crash-proof execution, with workflows resuming where they left off after crashes, network failures, or infrastructure outages.⁴ That is the right mental model for long-running agents too. The durability lives in the workflow state, not in the transient worker process.

A state machine beats a promise chain

The simplest version that holds up in production is an explicit state machine.

State	Meaning	What changes it
`queued`	Job exists, not yet started	Worker claims the job
`running`	Step execution in progress	Step succeeds, fails, or pauses
`waiting_for_approval`	Human gate is pending	Approval or timeout
`retrying`	A step failed and is scheduled again	Retry policy or backoff
`cancelled`	User stopped the job	Cancellation request or revocation
`failed`	Final failure with a reason	Exhausted retries or fatal error
`dead_lettered`	Message was isolated for inspection	DLQ redrive or poison-job policy
`completed`	Work finished successfully	Final output persisted

If you cannot tell which state the job is in, you cannot tell what should happen next.

Idempotency is not optional

Duplicate delivery is not a corner case. It is the default you design around.

Stripe’s idempotent request docs are useful even outside payments because they make the rule concrete: if you send the same request with the same idempotency key, the API should treat it as the same operation instead of creating new side effects.⁵ Stripe also documents request ids and replay behavior for troubleshooting, which is a good clue that idempotency and observability belong together.⁶

For agent jobs, the practical pattern is:

generate one stable run_id
derive a step-specific idempotency key
store the result against that key
return the stored result on retry instead of repeating the action

Example:

idempotency_key = run_id + ":" + step_name + ":" + operation_name

That is enough to keep a retry from creating five tickets or sending five emails.

Checkpoints are where durability happens

Do not wait until the end of the workflow to persist state.

Persist after every irreversible step:

context gathered
plan drafted
approval requested
side effect executed
external id captured
final summary written

If the process dies after step 4, the next worker should resume from step 5, not rediscover the whole world.

That is also where compensation logic lives. If step 4 succeeded but step 5 failed, the workflow should know whether to retry, compensate, or stop and ask a human.

Cancellation has to be real

A cancel button that only hides the spinner is a UX lie.

Cancellation should:

flip the terminal state in the job record
stop dequeuing new steps
revoke or shorten future retries
propagate to in-flight work where possible
stop presenting progress as if the job is still active

If you are using a durable workflow engine, cancellation should become part of the workflow semantics. If you are using a message queue, you still need the job record to remember that future steps are off limits.

Progress should come from state transitions

Long-running agent jobs lose trust when the UI goes silent.

The fix is not fake percentages. It is evented progress:

queued
running context assembly
waiting for approval
retrying tool call
writing final result
completed

Those labels are useful because they correspond to actual durable transitions. When the UI shows them, the user can tell the system is alive without assuming it is finished.

Dead-letter queues are for diagnosis, not shame

SQS DLQs exist so you can isolate messages that do not process successfully and inspect why they failed.³

That is exactly how poison jobs should be handled in agent systems:

keep the original payload
record the last error
record the attempt count
record the step that failed
record the external ids already created

If the workflow failed after creating a real side effect, the DLQ entry should make that obvious. That is the difference between “the job died” and “the job died after sending the customer an email and creating a support ticket.”

What to log for a long-running job

At minimum, every job log should carry:

run_id
step_id
queue name
attempt number
visibility timeout or lease metadata
idempotency key
current state
checkpoint version
external ids created so far
cancellation status
failure reason if terminal

That is enough to reconstruct almost every support case I have seen.

What not to do

Do not bury a long-running agent inside one request that holds a connection open until it “finishes”. That usually gives you poor cancellation, poor retries, and no durable checkpoint after the first failure.

Do not rely on the model to remember where it was after a crash. The model is not the durable state. The workflow record is.

Do not keep retrying a step that can create side effects unless the step is idempotent or wrapped in a confirmation flow.

The practical standard

If I trust a long-running agent workflow, it is because I can answer these questions from the job record:

What state is it in right now?
What was the last durable checkpoint?
What side effects already happened?
What would a retry do?
What happens if the user cancels it now?
If it fails again, where does it go?

If those answers are fuzzy, the queue design is still prototype-grade.

Amazon SQS standard queues: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/standard-queues.html ↩
Amazon SQS visibility timeout: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html ↩
Amazon SQS dead-letter queues: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dead-letter-queues.html ↩ ↩²
Temporal docs: https://docs.temporal.io/ ↩
Stripe idempotent requests: https://docs.stripe.com/api/idempotent_requests ↩
Stripe request IDs: https://docs.stripe.com/api/request_ids ↩