Integration Checklists - aRuntime.com

Key takeaways

Use checklists as evidence gates, not as a substitute for architecture judgment.
Every item should have an owner, status, evidence link, and explicit exception process.
High-risk items fail closed and require review rather than being averaged into a score.
Re-run the relevant checklist when models, runtimes, tools, policies, hardware, or data boundaries change.
Archive completed checklists with the release and decision record.

Runtime boundary

A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.

Receives

Architecture, contracts, components, test evidence, risks, and release candidate.

Owns

Operational review structure and evidence expectations.

Emits

Pass/fail gates, owners, evidence, exceptions, remediation, and release decision.

Does not own

Automatic proof that a system is secure or correct.

Failure modes

Checkbox compliance without evidence, unclear owner, stale review, waived mandatory control, and missing failure tests.

Evidence and metrics

Items passed/failed/waived, evidence completeness, review age, remediation time, and post-release incidents.

Contract readiness

Confirm request/response schemas, identity/tenant, risk, tools, output, trace, deadlines/budgets, errors, compatibility, and fixtures.

Implementation

Require generated validation and contract tests across supported clients and versions.

Operational implications

Block unknown incompatible versions and ambiguous errors.

Measure

Validation pass, client matrix, deprecated use, and schema coverage.

Model and adapter readiness

Confirm exact model/tokenizer, format, conversion parity, precision, route capability, provider limits, streaming/cancellation, and fallback.

Implementation

Run compatibility and quality fixtures on the deployment artifact.

Operational implications

Do not approve from a provider capability list alone.

Measure

Load/parity, route errors, invalid output, and fallback.

Context and RAG readiness

Confirm source ownership, identity filters, classification, freshness, chunking, citation, prompt-injection treatment, token budget, and semantic-layer policy.

Implementation

Test denied tenant/field access and stale or malicious sources.

Operational implications

Reject raw uncontrolled database access.

Measure

Source/citation, policy deny, retrieval latency, freshness, and injection cases.

Tool readiness

Confirm versioned schemas, permission, credential scope, target validation, side-effect class, idempotency, timeout/retry, approval, output validation, sandbox, and audit.

Implementation

Run duplicate, timeout-after-dispatch, malformed output, and unauthorized target tests.

Operational implications

High-impact tools fail closed if policy or approval is unavailable.

Measure

Tool validation/auth/approval, duplicate prevention, result validity, and side-effect verification.

Memory readiness

Confirm scopes, schemas, provenance, confidence, owner, write authority, conflict, expiry, retention, deletion, and review.

Implementation

Test cross-tenant isolation, poisoning, conflict, deletion, and stale-session behavior.

Operational implications

Systems of record remain authoritative.

Measure

Reads/writes, conflicts, deletion, poison detections, and scope violations.

Security and governance readiness

Confirm threat model, identities, least privilege, egress, artifact integrity, secrets, isolation, tenant controls, output constraints, logging privacy, approvals, and incident plan.

Implementation

Run abuse and boundary tests mapped to OWASP/MITRE/NIST where relevant.

Operational implications

Mandatory controls require explicit owner and evidence.

Measure

Denied attacks, integrity checks, redaction, incidents, and exceptions.

Observability readiness

Confirm trace propagation, phase spans, versions, usage, cache/scheduler, tools/policy/memory events, metrics, logs, evaluations, sampling, redaction, retention, and dashboards.

Implementation

Generate a synthetic end-to-end trace and verify incident query/replay.

Operational implications

No production launch with untraceable privileged tool actions.

Measure

Trace completeness, export, cardinality, redaction, evaluation, and alert test.

Performance and benchmark readiness

Confirm production-shaped workload, environment manifest, warmup/cache state, quality, TTFT/TPOT/E2E, Goodput, errors, memory, power/cost, repetitions, and raw evidence.

Implementation

Run at load beyond the SLO knee and under dependency failure.

Operational implications

Do not promote from average latency or vendor numbers.

Measure

Goodput/SLO, errors, memory, cost, quality, and variance.

Deployment and recovery readiness

Confirm immutable artifacts, compatibility tuple, readiness, canary, autoscaling, quotas, regional/data policy, rollback, backup/state recovery, and degraded mode.

Implementation

Rehearse bad model, node loss, provider outage, queue overload, and rollback.

Operational implications

Recovery evidence is part of release.

Measure

Ready/scale/recovery/rollback, failed rollout, queue, and availability.

Operations and lifecycle readiness

Confirm owners, SLOs, alerts, runbooks, on-call, cost budgets, dependency status, security contact, upgrade cadence, link/source review, correction process, and decision review triggers.

Implementation

Schedule post-release observation and archive evidence.

Operational implications

A system without an operating owner is not production-ready.

Measure

SLO incidents, toil, alert quality, cost variance, review age, and correction time.

Reference tables

Checklist evidence record
Field	Purpose
Item ID/version	Stable reference
Owner/reviewer	Accountability
Status	Pass/fail/waived/not applicable
Evidence URI	Test, trace, config, report
Risk/exception	Why it is not passing
Remediation/date	Next action
Release/decision link	Traceability
Reviewed UTC	Freshness

Decision checklist

Does every mandatory item have an owner and evidence?
Which exceptions exist and who approved them?
What failure tests were executed?
Can the release be rolled back safely?
Which dependency or source status must be rechecked?
What change triggers re-running each checklist?
Where is the completed evidence archived?

Common mistakes

Treating all checklist items as equally waivable.
Marking “implemented” without test or runtime evidence.
Using one checklist for model, tool, and data changes with different risks.
Approving performance without quality and error gates.
Skipping recovery rehearsal.
Leaving checklist ownership to an unnamed team.
Failing to archive the reviewed version.

Sources and further reading

NIST AI Risk Management Framework
(opens in a new tab)

NIST · Government framework · accessed 2026-06-21 UTC
OWASP Top 10 for LLM Applications
(opens in a new tab)

OWASP GenAI Security Project · Official documentation · accessed 2026-06-21 UTC
OpenTelemetry concepts
(opens in a new tab)

OpenTelemetry · Official documentation · accessed 2026-06-21 UTC
MLPerf Inference
(opens in a new tab)

MLCommons · Benchmark specification · accessed 2026-06-21 UTC
JSON Schema specification
(opens in a new tab)

JSON Schema · Specification · accessed 2026-06-21 UTC

Last reviewed: 2026-06-21 UTC

Key takeaways

Runtime boundary

Receives

Owns

Emits

Does not own

Failure modes

Evidence and metrics

Contract readiness

Implementation

Operational implications

Measure

Model and adapter readiness

Implementation

Operational implications

Measure

Context and RAG readiness

Implementation

Operational implications

Measure

Tool readiness

Implementation

Operational implications

Measure

Memory readiness

Implementation

Operational implications

Measure

Security and governance readiness

Implementation

Operational implications

Measure

Observability readiness

Implementation

Operational implications

Measure

Performance and benchmark readiness

Implementation

Operational implications

Measure

Deployment and recovery readiness

Implementation

Operational implications

Measure

Operations and lifecycle readiness

Implementation

Operational implications

Measure

Reference tables

Decision checklist

Common mistakes

Sources and further reading

Maintenance record