Runtime Observability

Runtime observability explains how infrastructure, model, serving, tool, policy, business, and evaluation behavior combine during a request. It must preserve correlation without turning sensitive model and tool payloads into unrestricted logs.

Key takeaways

Separate operational telemetry from durable review evidence.
Correlate layers with stable identifiers and explicit span kinds.
Measure queueing, cache, tools, approvals, and recovery—not only model latency.

Signal model

Infrastructure

CPU/GPU/NPU, memory, storage, network, process, worker, queue, and dependency health.

Compiler and model

Artifact/version, compile/warmup, route, prefill, decode, cache, tokens, and stop reason.

Serving

Admission, queueing, batching, routing, replica, rollout, overload, and deadline.

Tool and policy

Tool version, operation key, authorization, approval, dependency, side effect, and compensation.

Business outcome

Domain result, changed resource, customer-visible status, and accountable decision.

Evaluation

Quality, safety, task success, evidence completeness, cost, and failure classification.

Correlation

Use requestId and correlationId across application boundaries, traceId/spanId for execution structure, operation IDs for tools, policy decision and approval IDs for authorization, and artifact/evidence IDs for durable review. Async messages use trace links where a single parent is misleading.

Metrics

Representative metrics by layer
Layer	Metrics
Compiler/engine	load, warmup, compile, prefill, decode, cache occupancy, allocation failure
Serving	queue, admission rejection, batch, route, TTFT, TPOT, goodput, rollout health
Distributed	collective, transfer, remote-cache hit, placement, worker failure
Agentic	step count, tool latency, approval wait, recovery, budget, evidence completion
Product	successful outcome, correction, abandonment, safe denial, incident

Logs and events

Use structured events with stable names, versions, UTC timestamps, severity, source layer, correlation, and sanitized attributes. Avoid free-form logging of prompts and tool payloads. Log changes to policy, route, model, tool, and configuration separately from request events.

Traces

Trace model invocation, tool calls, policy decisions, waits, and recovery as distinct spans. Model spans should include deployment/version, input/output token counts, phase durations, cache indicators, and stop reason when available. OpenTelemetry’s generative-AI conventions are evolving; pin the convention version used by an implementation. [ar_cite id=”otel-genai” label=”OpenTelemetry”]

Evidence boundary

Observability may be sampled, short-lived, or operator-focused. Evidence is selected for durable review and may include artifact hashes, policy reasons, approvals, side-effect records, and failure history. Evidence should reference traces without requiring every trace payload to be retained.

Privacy and sampling

Default raw prompt and completion capture off.
Separate restricted payload storage from broad metrics.
Apply tenant-aware access and deletion.
Retain denials, high-risk actions, errors, and evidence-required events even when routine traces are sampled.
Do not send search text, prompts, or contact content to analytics by default.

Operational views

Provide service-objective, capacity, model quality, tool reliability, approval backlog, security events, recovery, cost, and evidence-gap views. Each dashboard links from aggregate metrics to minimized request evidence under authorization.

Find runtime definitions and implementation guidance