AI runtime evaluation measures whether a composed system completes the intended task safely, reliably, efficiently, and with sufficient evidence. Model quality is one input, not the complete outcome.
Key takeaways
- Evaluate components, workflows, and domain outcomes separately.
- Use repeated trials and failure classification for stochastic behavior.
- Gate releases on safety, reliability, cost, and evidence—not average answer score alone.
Evaluation levels
Model
Task quality, calibration, format adherence, refusal, and robustness.
Runtime component
Routing, cache, tool, policy, memory, recovery, and trace correctness.
Workflow
End-to-end success, latency, cost, approvals, side effects, and safe failure.
Domain outcome
Business or user result validated by an accountable source of truth.
Cases and datasets
Use representative tasks with input classification, expected constraints, allowed actions, evaluation method, and known failure modes. Include normal, ambiguous, adversarial, unavailable-dependency, and partial-side-effect cases. Preserve versions and provenance.
Metric families
- Task completion and first-attempt success
- Unsupported claim, invalid output, and correction rate
- Tool success, permission denial, duplicate effect, and compensation
- Recovery success and time
- TTFT, TPOT, task duration, and approval delay
- Tokens, compute, tool cost, human review, and energy where measurable
- Policy violation and safe-denial rate
- Evidence completeness and redaction compliance
Stochastic evaluation
Repeat runs across seeds or sampling conditions and report distributions, not one favorable trace. Separate nondeterministic model variation from infrastructure variance. Store the exact model deployment, prompt/instruction version, route, tools, policy, and runtime configuration.
Online evaluation
Use canary, shadow, and sampled review with privacy controls. Online evaluators should not become an unbounded model-calling loop. Detect drift in task mix, quality, tool failures, latency, cost, and policy outcomes and connect alerts to rollback or investigation.
Safety and governance
Test indirect prompt injection, data-boundary violations, privilege escalation, unsafe tool plans, approval bypass, poisoned memory, and evidence tampering. Human reviewers need clear rubrics and conflict-of-interest disclosure. Model-based evaluators are useful signals but not sole authority for high-impact outcomes.
Release gates
Define minimum quality and evidence, maximum violation/error/cost, rollback criteria, and owner. Gate changes to model, engine, route, prompt, tool, policy, retrieval, and application separately where possible. Record the evaluation suite and result with the release.
Limits
Offline test sets cannot cover every production context. Metrics can be gamed or misaligned with user outcomes. Treat evaluation as a continuous risk-control process, document uncertainty, and maintain a correction path.
