Durable agent execution allows a long-running AI task to survive process loss, deployment, timeout, and human wait without repeating completed or irreversible work.
Key takeaways
- Persist task state independently from disposable model/tool workers.
- Replay orchestration decisions from recorded results; do not assume model calls are deterministic.
- Use idempotency and compensation at every external side-effect boundary.
Definition
A durable runtime assigns a stable task identity, records state transitions, and can reconstruct what should happen next. It distinguishes accepted, started, completed, failed, compensated, and awaiting-approval states.
Durable state
Persist request contract, current phase, completed steps, stable outputs or artifact references, idempotency keys, approval state, deadlines, budgets, and evidence correlation. Keep large sensitive payloads in protected stores and reference them rather than duplicating them into workflow history.
Checkpoints and activities
A checkpoint marks a recoverable boundary. Activities such as model invocation, sandbox execution, and external tools run outside deterministic orchestration and return recorded results. Checkpoint frequency balances recovery granularity against storage and latency.
Idempotency
Every write-oriented tool call receives a stable operation key and an authoritative status lookup. Duplicate delivery returns the prior result or conflict. For operations that cannot be idempotent, require stronger approval and a compensation plan.
Long waits and approvals
Durable timers and signals allow a task to release compute while waiting for a person, webhook, batch job, or external system. Resume validates approval expiry, resource version, and whether the proposed action is still applicable.
Replay semantics
Workflow replay reconstructs control flow from recorded events. Model and tool calls return recorded results during replay unless intentionally re-executed under a new attempt. Persist versions so the runtime can detect incompatible code or schema changes.
Failure model
- Worker lost before activity starts: reschedule.
- Worker lost after side effect but before acknowledgement: query idempotency record.
- Checkpoint unavailable: fail closed for state-changing tasks.
- Approval expired: return to review rather than execute.
- Code version incompatible with history: migrate explicitly or retain compatible worker.
- Cancellation: propagate to active workers and record whether effects committed.
Testing
Kill workers at every boundary, deliver duplicate events, delay approvals, change external state during waits, deploy incompatible code, and simulate checkpoint-store outage. Assert that effects are not duplicated and the task produces an accurate evidence trail.
