Failure Recovery

AI runtime recovery is not “retry the model.” It is a state-aware decision about what completed, what changed externally, what remains authorized, and whether the task can safely continue.

Key takeaways

Classify failures by phase, side effects, retryability, and evidence requirements.
Use idempotency for duplicate delivery and compensation for already-committed effects.
Recovery must remain within the original authority, risk, budget, and deadline unless a new approval changes them.

[ar_diagram id=”failure-recovery-state-machine”]

Recovery principles

Detect the failure at the lowest layer with enough context.
Normalize it into a stable error taxonomy for callers.
Determine whether execution started and whether any side effect committed.
Check deadline, retry budget, authorization, idempotency, and dependency health.
Choose retry, route fallback, checkpoint restore, compensation, approval, escalation, or termination.
Persist the decision and observed outcome.

Failure state machine

A task can move from running to awaiting approval, recovering, completed, failed, or terminated. Recovery should not be a hidden loop. It is a visible state with attempt count, selected checkpoint, policy decision, and expiry. Human review may authorize a new action but should not rewrite the history of prior attempts.

Failure catalog

Runtime failure classes
Class	Examples	Default response
Admission	Invalid contract, quota, unsupported version, expired deadline	Reject without starting execution
Model	Timeout, overload, refusal, malformed output	Retry or fallback only within route policy and budget
Context	Unavailable source, stale index, classification conflict	Stop or return qualified incomplete result
Tool	Timeout, invalid response, permission denial, partial commit	Query status, compensate, or escalate; avoid blind retry
Policy	Denied action, expired approval, policy service unavailable	Fail closed for privileged effects
State	Checkpoint write failure, memory conflict, duplicate delivery	Use idempotency and authoritative state reconciliation
Infrastructure	Worker loss, device error, network partition, storage outage	Reschedule when safe and preserve attempt identity
Evidence	Trace gap, artifact hash failure, retention write failure	Mark degraded or fail closed according to evidence policy

Retry eligibility

A retry is eligible only when the operation is read-only, idempotent, or its prior state can be authoritatively determined. Use exponential backoff and jitter for transient dependencies, but keep the retry budget bounded by the request deadline. Model retries should specify whether prompt, model route, decoding constraints, or context changed; otherwise evaluation cannot distinguish repeated sampling from a real recovery strategy.

Side effects and compensation

Reversible writes should expose an operation identifier and compensation action. Irreversible or high-impact tools should require stronger admission and approval because rollback may be impossible. Compensation is not the same as transaction rollback: sending a second communication does not erase the first; issuing a refund does not undo disclosure. Evidence should represent both actions.

Recovery evidence

Failure category, code, source layer, and UTC timestamp
Observed external state and completed steps
Checkpoint or prior result selected
Retry eligibility and policy reason
Changed model route, context, tool version, or parameters
Compensation or rollback action and result
Human decision and approval expiry
Final outcome and unresolved uncertainty

Failure testing

Inject timeouts, duplicate deliveries, worker loss, invalid structured output, denied permissions, partial tool commits, expired approvals, checkpoint-store outages, and evidence-write failures. Tests should assert not only that the task returns an error, but that unauthorized work did not occur, completed effects were not duplicated, recovery stayed within budget, and evidence is sufficient to reconstruct the decision.

Find runtime definitions and implementation guidance