Runtime observability explains how infrastructure, model, serving, tool, policy, business, and evaluation behavior combine during a request. It must preserve correlation without turning sensitive model and tool payloads into unrestricted logs.
Key takeaways
- Separate operational telemetry from durable review evidence.
- Correlate layers with stable identifiers and explicit span kinds.
- Measure queueing, cache, tools, approvals, and recovery—not only model latency.
Signal model
Infrastructure
CPU/GPU/NPU, memory, storage, network, process, worker, queue, and dependency health.
Compiler and model
Artifact/version, compile/warmup, route, prefill, decode, cache, tokens, and stop reason.
Serving
Admission, queueing, batching, routing, replica, rollout, overload, and deadline.
Tool and policy
Tool version, operation key, authorization, approval, dependency, side effect, and compensation.
Business outcome
Domain result, changed resource, customer-visible status, and accountable decision.
Evaluation
Quality, safety, task success, evidence completeness, cost, and failure classification.
Correlation
Use requestId and correlationId across application boundaries, traceId/spanId for execution structure, operation IDs for tools, policy decision and approval IDs for authorization, and artifact/evidence IDs for durable review. Async messages use trace links where a single parent is misleading.
Metrics
| Layer | Metrics |
|---|---|
| Compiler/engine | load, warmup, compile, prefill, decode, cache occupancy, allocation failure |
| Serving | queue, admission rejection, batch, route, TTFT, TPOT, goodput, rollout health |
| Distributed | collective, transfer, remote-cache hit, placement, worker failure |
| Agentic | step count, tool latency, approval wait, recovery, budget, evidence completion |
| Product | successful outcome, correction, abandonment, safe denial, incident |
Logs and events
Use structured events with stable names, versions, UTC timestamps, severity, source layer, correlation, and sanitized attributes. Avoid free-form logging of prompts and tool payloads. Log changes to policy, route, model, tool, and configuration separately from request events.
Traces
Trace model invocation, tool calls, policy decisions, waits, and recovery as distinct spans. Model spans should include deployment/version, input/output token counts, phase durations, cache indicators, and stop reason when available. OpenTelemetry’s generative-AI conventions are evolving; pin the convention version used by an implementation. [ar_cite id=”otel-genai” label=”OpenTelemetry”]
Evidence boundary
Observability may be sampled, short-lived, or operator-focused. Evidence is selected for durable review and may include artifact hashes, policy reasons, approvals, side-effect records, and failure history. Evidence should reference traces without requiring every trace payload to be retained.
Privacy and sampling
- Default raw prompt and completion capture off.
- Separate restricted payload storage from broad metrics.
- Apply tenant-aware access and deletion.
- Retain denials, high-risk actions, errors, and evidence-required events even when routine traces are sampled.
- Do not send search text, prompts, or contact content to analytics by default.
Operational views
Provide service-objective, capacity, model quality, tool reliability, approval backlog, security events, recovery, cost, and evidence-gap views. Each dashboard links from aggregate metrics to minimized request evidence under authorization.
