Key takeaways
- Use checklists as evidence gates, not as a substitute for architecture judgment.
- Every item should have an owner, status, evidence link, and explicit exception process.
- High-risk items fail closed and require review rather than being averaged into a score.
- Re-run the relevant checklist when models, runtimes, tools, policies, hardware, or data boundaries change.
- Archive completed checklists with the release and decision record.
Runtime boundary
A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.
Receives
Architecture, contracts, components, test evidence, risks, and release candidate.
Owns
Operational review structure and evidence expectations.
Emits
Pass/fail gates, owners, evidence, exceptions, remediation, and release decision.
Does not own
Automatic proof that a system is secure or correct.
Failure modes
Checkbox compliance without evidence, unclear owner, stale review, waived mandatory control, and missing failure tests.
Evidence and metrics
Items passed/failed/waived, evidence completeness, review age, remediation time, and post-release incidents.
Contract readiness
Confirm request/response schemas, identity/tenant, risk, tools, output, trace, deadlines/budgets, errors, compatibility, and fixtures.
Implementation
Require generated validation and contract tests across supported clients and versions.
Operational implications
Block unknown incompatible versions and ambiguous errors.
Measure
Validation pass, client matrix, deprecated use, and schema coverage.
Model and adapter readiness
Confirm exact model/tokenizer, format, conversion parity, precision, route capability, provider limits, streaming/cancellation, and fallback.
Implementation
Run compatibility and quality fixtures on the deployment artifact.
Operational implications
Do not approve from a provider capability list alone.
Measure
Load/parity, route errors, invalid output, and fallback.
Context and RAG readiness
Confirm source ownership, identity filters, classification, freshness, chunking, citation, prompt-injection treatment, token budget, and semantic-layer policy.
Implementation
Test denied tenant/field access and stale or malicious sources.
Operational implications
Reject raw uncontrolled database access.
Measure
Source/citation, policy deny, retrieval latency, freshness, and injection cases.
Tool readiness
Confirm versioned schemas, permission, credential scope, target validation, side-effect class, idempotency, timeout/retry, approval, output validation, sandbox, and audit.
Implementation
Run duplicate, timeout-after-dispatch, malformed output, and unauthorized target tests.
Operational implications
High-impact tools fail closed if policy or approval is unavailable.
Measure
Tool validation/auth/approval, duplicate prevention, result validity, and side-effect verification.
Memory readiness
Confirm scopes, schemas, provenance, confidence, owner, write authority, conflict, expiry, retention, deletion, and review.
Implementation
Test cross-tenant isolation, poisoning, conflict, deletion, and stale-session behavior.
Operational implications
Systems of record remain authoritative.
Measure
Reads/writes, conflicts, deletion, poison detections, and scope violations.
Security and governance readiness
Confirm threat model, identities, least privilege, egress, artifact integrity, secrets, isolation, tenant controls, output constraints, logging privacy, approvals, and incident plan.
Implementation
Run abuse and boundary tests mapped to OWASP/MITRE/NIST where relevant.
Operational implications
Mandatory controls require explicit owner and evidence.
Measure
Denied attacks, integrity checks, redaction, incidents, and exceptions.
Observability readiness
Confirm trace propagation, phase spans, versions, usage, cache/scheduler, tools/policy/memory events, metrics, logs, evaluations, sampling, redaction, retention, and dashboards.
Implementation
Generate a synthetic end-to-end trace and verify incident query/replay.
Operational implications
No production launch with untraceable privileged tool actions.
Measure
Trace completeness, export, cardinality, redaction, evaluation, and alert test.
Performance and benchmark readiness
Confirm production-shaped workload, environment manifest, warmup/cache state, quality, TTFT/TPOT/E2E, Goodput, errors, memory, power/cost, repetitions, and raw evidence.
Implementation
Run at load beyond the SLO knee and under dependency failure.
Operational implications
Do not promote from average latency or vendor numbers.
Measure
Goodput/SLO, errors, memory, cost, quality, and variance.
Deployment and recovery readiness
Confirm immutable artifacts, compatibility tuple, readiness, canary, autoscaling, quotas, regional/data policy, rollback, backup/state recovery, and degraded mode.
Implementation
Rehearse bad model, node loss, provider outage, queue overload, and rollback.
Operational implications
Recovery evidence is part of release.
Measure
Ready/scale/recovery/rollback, failed rollout, queue, and availability.
Operations and lifecycle readiness
Confirm owners, SLOs, alerts, runbooks, on-call, cost budgets, dependency status, security contact, upgrade cadence, link/source review, correction process, and decision review triggers.
Implementation
Schedule post-release observation and archive evidence.
Operational implications
A system without an operating owner is not production-ready.
Measure
SLO incidents, toil, alert quality, cost variance, review age, and correction time.
Reference tables
| Field | Purpose |
|---|---|
| Item ID/version | Stable reference |
| Owner/reviewer | Accountability |
| Status | Pass/fail/waived/not applicable |
| Evidence URI | Test, trace, config, report |
| Risk/exception | Why it is not passing |
| Remediation/date | Next action |
| Release/decision link | Traceability |
| Reviewed UTC | Freshness |
Decision checklist
- Does every mandatory item have an owner and evidence?
- Which exceptions exist and who approved them?
- What failure tests were executed?
- Can the release be rolled back safely?
- Which dependency or source status must be rechecked?
- What change triggers re-running each checklist?
- Where is the completed evidence archived?
Common mistakes
- Treating all checklist items as equally waivable.
- Marking “implemented” without test or runtime evidence.
- Using one checklist for model, tool, and data changes with different risks.
- Approving performance without quality and error gates.
- Skipping recovery rehearsal.
- Leaving checklist ownership to an unnamed team.
- Failing to archive the reviewed version.
Sources and further reading
-
NIST AI Risk Management Framework
(opens in a new tab)
-
OWASP Top 10 for LLM Applications
(opens in a new tab)
-
OpenTelemetry concepts
(opens in a new tab)
-
MLPerf Inference
(opens in a new tab)
-
JSON Schema specification
(opens in a new tab)
Last reviewed: 2026-06-21 UTC
