Runtime Observability - aRuntime.com

Key takeaways

One trace should cross gateway, router, model server, engine, tools, memory, policy, evaluation, and response.
Metrics summarize fleet behavior; traces explain paths; logs record events; evaluations assess quality and outcome.
Queue, prefill, decode, transfer, tool, and policy time must be separated.
Model, runtime, prompt/template, tool, policy, and evaluator versions belong in evidence.
Sensitive prompts, credentials, tool results, personal data, and model internals require redaction and retention controls.

Runtime boundary

A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.

Receives

Trace context, runtime events, counters, timings, stable input/output references, versions, redaction policy, and evaluations.

Owns

Instrumentation schema, propagation, sampling, redaction, retention, correlation, export, and observability quality.

Emits

Traces, metrics, structured logs, profiles, dashboards, alerts, cost attribution, replay handles, and incident evidence.

Does not own

Authorization to store unrestricted sensitive content or proof of correctness without evaluation.

Failure modes

Broken propagation, high-cardinality explosion, secret leakage, clock skew, sampling bias, missing versions, unbounded logs, and misleading dashboards.

Evidence and metrics

Trace completeness, dropped spans, export lag, cardinality, redaction failures, phase timings, cost, errors, evaluation coverage, and replay success.

Traces, metrics, logs, and evaluations

Traces represent one request, metrics aggregate fleet behavior, logs capture discrete events, and evaluations assess outputs or task outcomes.

Implementation

Use each signal for its proper purpose and correlate them with stable identifiers.

Operational implications

No single signal can explain prevalence, individual cause, and quality at once.

Measure

Trace completeness, metric freshness, log volume, evaluation coverage, and cross-signal correlation.

Trace schema and propagation

The root span carries request/contract identity and child spans identify component, operation, attempt, status, timing, versions, usage, and decisions.

Implementation

Use standard trace context across services and span links across durable resumes, async callbacks, or approvals.

Operational implications

Keep baggage small and non-sensitive; store protected values by reference.

Measure

Propagation success, orphan spans, clock skew, dropped spans, and trace duration.

Phase timing

Actionable latency separates gateway, admission, queue, tokenization, retrieval, route, load/warmup, prefill, decode, tool, policy, memory, evaluation, and delivery.

Implementation

Adopt consistent span names and histogram boundaries by workload class.

Operational implications

One “AI latency” field hides the owner and cause of regressions.

Measure

p50/p95/p99 per phase, critical path, overlap, and SLO contribution.

Token, cache, and scheduler metrics

Generation systems need prompt/cached/output tokens, active tokens, blocks, batch composition, queue, cache hit, evictions, and cleanup.

Implementation

Define token accounting and cache metrics consistently across runtimes and providers.

Operational implications

High hit rate can have little value if matched prefixes are short; active request count can hide token pressure.

Measure

Cached/prefill tokens avoided, active blocks, allocation failure, queue, Goodput, and cleanup.

Tool, policy, memory, and approval events

Agentic traces need tool proposal, validation, authorization, approval, execution, result validation, memory reads/writes, and policy effects.

Implementation

Record normalized schemas, stable resource refs, decision IDs, side-effect class, idempotency, and redaction state.

Operational implications

Do not log credentials, raw secrets, or uncontrolled personal data.

Measure

Tool success/retry, policy deny, approval time, memory write outcome, and side-effect verification.

Cost and attribution

Cost can include model/provider usage, accelerator time, external tools, storage, transfer, and evaluations.

Implementation

Use controlled tenant/product/model dimensions and reconcile estimates with billing.

Operational implications

Avoid user-controlled strings in metric labels; separate estimated from invoiced cost.

Measure

Cost/request, cost/success, cost/model/tenant, variance, and unattributed cost.

Sampling and retention

Sampling controls cost and privacy but can bias incident and quality evidence.

Implementation

Use head sampling for predictable volume, tail sampling for errors/slow traces, and deterministic cohorts for evaluation.

Operational implications

Apply classification, access, retention, deletion, and export audit to telemetry systems.

Measure

Sample rate by class, retained errors, retention age, deletion success, and access audit.

Replay and incident analysis

Replay reconstructs contract, versions, references, state checkpoints, and control decisions.

Implementation

Preserve immutable version identifiers and protected evidence refs; publish redacted incident timelines.

Operational implications

A repeated stochastic model call is not identical replay, but control flow and side effects can be audited.

Measure

Replay completeness, missing evidence, reproduction rate, time to diagnosis, and corrective action.

Low-level profiling

Kernel, CPU, memory, and network profiles explain hotspots beneath request spans.

Implementation

Correlate samples with trace/request IDs where possible and bound profiling overhead.

Operational implications

Use targeted profiling during controlled windows; always-on high-detail profiling can distort performance.

Measure

Kernel occupancy, CPU hotspots, allocation, bandwidth, network stalls, and profiling overhead.

Reference tables

Observability signals
Signal	Best use	AI runtime examples	Risk
Trace	Explain one distributed request	Queue, prefill, decode, tool, policy, memory spans	Sensitive attributes/incomplete propagation
Metric	Fleet trends and SLOs	TTFT, cache hit, Goodput, GPU memory	High cardinality/misleading averages
Log	Discrete event detail	Model load, policy denial, eviction	Unstructured volume/secrets
Evaluation	Quality/outcome assessment	Citation quality, task success	Evaluator drift/unclear criteria
Profile	Low-level bottlenecks	Kernel, CPU, memory, network	Overhead/correlation difficulty

Minimum trace fields
Field group	Examples
Identity/contract	Trace ID, request ID, actor/tenant refs, contract version
Versions	Model, runtime, backend, prompt, tool, policy, evaluator
Timing	Queue, prefill, decode, tool, approval, evaluation, E2E
Usage	Prompt/cached/output tokens, bytes, cache blocks, cost
Decisions	Route, policy effect, approval, retry/fallback
Outcome	Status, finish reason, task evaluation, human handoff
Privacy	Classification, redaction, retention, evidence refs

Decision checklist

What root identifier crosses every component and async boundary?
Which phase timings diagnose SLOs?
Which versions must be attached to every result?
What data is stored directly versus by protected reference?
How are sampling and retention different for failures/evaluations?
How is high-cardinality metadata controlled?
Can replay reconstruct decisions and state transitions?
Who can access, export, and delete telemetry?

Common mistakes

Logging full prompts and tool results by default.
Using one average latency metric for the whole runtime.
Omitting model, runtime, tool, or policy versions.
Putting user-controlled strings into metric labels.
Sampling away rare policy failures or tail latency.
Assuming simple parentage across durable workflows.
Calling a reissued prompt a deterministic replay.

Sources and further reading

OpenTelemetry concepts
(opens in a new tab)

OpenTelemetry · Official documentation · accessed 2026-06-21 UTC
Trace semantic conventions
(opens in a new tab)

OpenTelemetry · Official specification · accessed 2026-06-21 UTC
Generative AI semantic conventions
(opens in a new tab)

OpenTelemetry · Official specification · accessed 2026-06-21 UTC
Trace Context
(opens in a new tab)

W3C · Standard · accessed 2026-06-21 UTC
NIST Privacy Framework
(opens in a new tab)

NIST · Government framework · accessed 2026-06-21 UTC

Last reviewed: 2026-06-21 UTC

Key takeaways

Runtime boundary

Receives

Owns

Emits

Does not own

Failure modes

Evidence and metrics

Traces, metrics, logs, and evaluations

Implementation

Operational implications

Measure

Trace schema and propagation

Implementation

Operational implications

Measure

Phase timing

Implementation

Operational implications

Measure

Token, cache, and scheduler metrics

Implementation

Operational implications

Measure

Tool, policy, memory, and approval events

Implementation

Operational implications

Measure

Cost and attribution

Implementation

Operational implications

Measure

Sampling and retention

Implementation

Operational implications

Measure

Replay and incident analysis

Implementation

Operational implications

Measure

Low-level profiling

Implementation

Operational implications

Measure

Reference tables

Decision checklist

Common mistakes

Sources and further reading

Maintenance record