Key takeaways
- One trace should cross gateway, router, model server, engine, tools, memory, policy, evaluation, and response.
- Metrics summarize fleet behavior; traces explain paths; logs record events; evaluations assess quality and outcome.
- Queue, prefill, decode, transfer, tool, and policy time must be separated.
- Model, runtime, prompt/template, tool, policy, and evaluator versions belong in evidence.
- Sensitive prompts, credentials, tool results, personal data, and model internals require redaction and retention controls.
Runtime boundary
A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.
Receives
Trace context, runtime events, counters, timings, stable input/output references, versions, redaction policy, and evaluations.
Owns
Instrumentation schema, propagation, sampling, redaction, retention, correlation, export, and observability quality.
Emits
Traces, metrics, structured logs, profiles, dashboards, alerts, cost attribution, replay handles, and incident evidence.
Does not own
Authorization to store unrestricted sensitive content or proof of correctness without evaluation.
Failure modes
Broken propagation, high-cardinality explosion, secret leakage, clock skew, sampling bias, missing versions, unbounded logs, and misleading dashboards.
Evidence and metrics
Trace completeness, dropped spans, export lag, cardinality, redaction failures, phase timings, cost, errors, evaluation coverage, and replay success.
Traces, metrics, logs, and evaluations
Traces represent one request, metrics aggregate fleet behavior, logs capture discrete events, and evaluations assess outputs or task outcomes.
Implementation
Use each signal for its proper purpose and correlate them with stable identifiers.
Operational implications
No single signal can explain prevalence, individual cause, and quality at once.
Measure
Trace completeness, metric freshness, log volume, evaluation coverage, and cross-signal correlation.
Trace schema and propagation
The root span carries request/contract identity and child spans identify component, operation, attempt, status, timing, versions, usage, and decisions.
Implementation
Use standard trace context across services and span links across durable resumes, async callbacks, or approvals.
Operational implications
Keep baggage small and non-sensitive; store protected values by reference.
Measure
Propagation success, orphan spans, clock skew, dropped spans, and trace duration.
Phase timing
Actionable latency separates gateway, admission, queue, tokenization, retrieval, route, load/warmup, prefill, decode, tool, policy, memory, evaluation, and delivery.
Implementation
Adopt consistent span names and histogram boundaries by workload class.
Operational implications
One “AI latency” field hides the owner and cause of regressions.
Measure
p50/p95/p99 per phase, critical path, overlap, and SLO contribution.
Token, cache, and scheduler metrics
Generation systems need prompt/cached/output tokens, active tokens, blocks, batch composition, queue, cache hit, evictions, and cleanup.
Implementation
Define token accounting and cache metrics consistently across runtimes and providers.
Operational implications
High hit rate can have little value if matched prefixes are short; active request count can hide token pressure.
Measure
Cached/prefill tokens avoided, active blocks, allocation failure, queue, Goodput, and cleanup.
Tool, policy, memory, and approval events
Agentic traces need tool proposal, validation, authorization, approval, execution, result validation, memory reads/writes, and policy effects.
Implementation
Record normalized schemas, stable resource refs, decision IDs, side-effect class, idempotency, and redaction state.
Operational implications
Do not log credentials, raw secrets, or uncontrolled personal data.
Measure
Tool success/retry, policy deny, approval time, memory write outcome, and side-effect verification.
Cost and attribution
Cost can include model/provider usage, accelerator time, external tools, storage, transfer, and evaluations.
Implementation
Use controlled tenant/product/model dimensions and reconcile estimates with billing.
Operational implications
Avoid user-controlled strings in metric labels; separate estimated from invoiced cost.
Measure
Cost/request, cost/success, cost/model/tenant, variance, and unattributed cost.
Sampling and retention
Sampling controls cost and privacy but can bias incident and quality evidence.
Implementation
Use head sampling for predictable volume, tail sampling for errors/slow traces, and deterministic cohorts for evaluation.
Operational implications
Apply classification, access, retention, deletion, and export audit to telemetry systems.
Measure
Sample rate by class, retained errors, retention age, deletion success, and access audit.
Replay and incident analysis
Replay reconstructs contract, versions, references, state checkpoints, and control decisions.
Implementation
Preserve immutable version identifiers and protected evidence refs; publish redacted incident timelines.
Operational implications
A repeated stochastic model call is not identical replay, but control flow and side effects can be audited.
Measure
Replay completeness, missing evidence, reproduction rate, time to diagnosis, and corrective action.
Low-level profiling
Kernel, CPU, memory, and network profiles explain hotspots beneath request spans.
Implementation
Correlate samples with trace/request IDs where possible and bound profiling overhead.
Operational implications
Use targeted profiling during controlled windows; always-on high-detail profiling can distort performance.
Measure
Kernel occupancy, CPU hotspots, allocation, bandwidth, network stalls, and profiling overhead.
Reference tables
| Signal | Best use | AI runtime examples | Risk |
|---|---|---|---|
| Trace | Explain one distributed request | Queue, prefill, decode, tool, policy, memory spans | Sensitive attributes/incomplete propagation |
| Metric | Fleet trends and SLOs | TTFT, cache hit, Goodput, GPU memory | High cardinality/misleading averages |
| Log | Discrete event detail | Model load, policy denial, eviction | Unstructured volume/secrets |
| Evaluation | Quality/outcome assessment | Citation quality, task success | Evaluator drift/unclear criteria |
| Profile | Low-level bottlenecks | Kernel, CPU, memory, network | Overhead/correlation difficulty |
| Field group | Examples |
|---|---|
| Identity/contract | Trace ID, request ID, actor/tenant refs, contract version |
| Versions | Model, runtime, backend, prompt, tool, policy, evaluator |
| Timing | Queue, prefill, decode, tool, approval, evaluation, E2E |
| Usage | Prompt/cached/output tokens, bytes, cache blocks, cost |
| Decisions | Route, policy effect, approval, retry/fallback |
| Outcome | Status, finish reason, task evaluation, human handoff |
| Privacy | Classification, redaction, retention, evidence refs |
Decision checklist
- What root identifier crosses every component and async boundary?
- Which phase timings diagnose SLOs?
- Which versions must be attached to every result?
- What data is stored directly versus by protected reference?
- How are sampling and retention different for failures/evaluations?
- How is high-cardinality metadata controlled?
- Can replay reconstruct decisions and state transitions?
- Who can access, export, and delete telemetry?
Common mistakes
- Logging full prompts and tool results by default.
- Using one average latency metric for the whole runtime.
- Omitting model, runtime, tool, or policy versions.
- Putting user-controlled strings into metric labels.
- Sampling away rare policy failures or tail latency.
- Assuming simple parentage across durable workflows.
- Calling a reissued prompt a deterministic replay.
Sources and further reading
-
OpenTelemetry concepts
(opens in a new tab)
-
Trace semantic conventions
(opens in a new tab)
-
Generative AI semantic conventions
(opens in a new tab)
-
Trace Context
(opens in a new tab)
-
NIST Privacy Framework
(opens in a new tab)
Last reviewed: 2026-06-21 UTC
