AI runtime recovery is not “retry the model.” It is a state-aware decision about what completed, what changed externally, what remains authorized, and whether the task can safely continue.
Key takeaways
- Classify failures by phase, side effects, retryability, and evidence requirements.
- Use idempotency for duplicate delivery and compensation for already-committed effects.
- Recovery must remain within the original authority, risk, budget, and deadline unless a new approval changes them.
[ar_diagram id=”failure-recovery-state-machine”]
Recovery principles
- Detect the failure at the lowest layer with enough context.
- Normalize it into a stable error taxonomy for callers.
- Determine whether execution started and whether any side effect committed.
- Check deadline, retry budget, authorization, idempotency, and dependency health.
- Choose retry, route fallback, checkpoint restore, compensation, approval, escalation, or termination.
- Persist the decision and observed outcome.
Failure state machine
A task can move from running to awaiting approval, recovering, completed, failed, or terminated. Recovery should not be a hidden loop. It is a visible state with attempt count, selected checkpoint, policy decision, and expiry. Human review may authorize a new action but should not rewrite the history of prior attempts.
Failure catalog
| Class | Examples | Default response |
|---|---|---|
| Admission | Invalid contract, quota, unsupported version, expired deadline | Reject without starting execution |
| Model | Timeout, overload, refusal, malformed output | Retry or fallback only within route policy and budget |
| Context | Unavailable source, stale index, classification conflict | Stop or return qualified incomplete result |
| Tool | Timeout, invalid response, permission denial, partial commit | Query status, compensate, or escalate; avoid blind retry |
| Policy | Denied action, expired approval, policy service unavailable | Fail closed for privileged effects |
| State | Checkpoint write failure, memory conflict, duplicate delivery | Use idempotency and authoritative state reconciliation |
| Infrastructure | Worker loss, device error, network partition, storage outage | Reschedule when safe and preserve attempt identity |
| Evidence | Trace gap, artifact hash failure, retention write failure | Mark degraded or fail closed according to evidence policy |
Retry eligibility
A retry is eligible only when the operation is read-only, idempotent, or its prior state can be authoritatively determined. Use exponential backoff and jitter for transient dependencies, but keep the retry budget bounded by the request deadline. Model retries should specify whether prompt, model route, decoding constraints, or context changed; otherwise evaluation cannot distinguish repeated sampling from a real recovery strategy.
Side effects and compensation
Reversible writes should expose an operation identifier and compensation action. Irreversible or high-impact tools should require stronger admission and approval because rollback may be impossible. Compensation is not the same as transaction rollback: sending a second communication does not erase the first; issuing a refund does not undo disclosure. Evidence should represent both actions.
Recovery evidence
- Failure category, code, source layer, and UTC timestamp
- Observed external state and completed steps
- Checkpoint or prior result selected
- Retry eligibility and policy reason
- Changed model route, context, tool version, or parameters
- Compensation or rollback action and result
- Human decision and approval expiry
- Final outcome and unresolved uncertainty
Failure testing
Inject timeouts, duplicate deliveries, worker loss, invalid structured output, denied permissions, partial tool commits, expired approvals, checkpoint-store outages, and evidence-write failures. Tests should assert not only that the task returns an error, but that unauthorized work did not occur, completed effects were not duplicated, recovery stayed within budget, and evidence is sufficient to reconstruct the decision.
