Search ARuntime.com

Find runtime definitions and implementation guidance

Search page titles, summaries, headings, glossary terms, use cases, and runtime-directory entries.

Enter at least two characters.

Architectures

Runtime Observability

Practical AI runtime observability covering traces, metrics, logs, token timing, cache metrics, tool and policy events, cost attribution, evaluation, replay, profiling, and privacy.

Audience: Technical readers Reading time: 6 minutes Status: Production guidance Last reviewed:

Key takeaways

  • One trace should cross gateway, router, model server, engine, tools, memory, policy, evaluation, and response.
  • Metrics summarize fleet behavior; traces explain paths; logs record events; evaluations assess quality and outcome.
  • Queue, prefill, decode, transfer, tool, and policy time must be separated.
  • Model, runtime, prompt/template, tool, policy, and evaluator versions belong in evidence.
  • Sensitive prompts, credentials, tool results, personal data, and model internals require redaction and retention controls.

Runtime boundary

A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.

Receives

Trace context, runtime events, counters, timings, stable input/output references, versions, redaction policy, and evaluations.

Owns

Instrumentation schema, propagation, sampling, redaction, retention, correlation, export, and observability quality.

Emits

Traces, metrics, structured logs, profiles, dashboards, alerts, cost attribution, replay handles, and incident evidence.

Does not own

Authorization to store unrestricted sensitive content or proof of correctness without evaluation.

Failure modes

Broken propagation, high-cardinality explosion, secret leakage, clock skew, sampling bias, missing versions, unbounded logs, and misleading dashboards.

Evidence and metrics

Trace completeness, dropped spans, export lag, cardinality, redaction failures, phase timings, cost, errors, evaluation coverage, and replay success.

Traces, metrics, logs, and evaluations

Traces represent one request, metrics aggregate fleet behavior, logs capture discrete events, and evaluations assess outputs or task outcomes.

Implementation

Use each signal for its proper purpose and correlate them with stable identifiers.

Operational implications

No single signal can explain prevalence, individual cause, and quality at once.

Measure

Trace completeness, metric freshness, log volume, evaluation coverage, and cross-signal correlation.

Trace schema and propagation

The root span carries request/contract identity and child spans identify component, operation, attempt, status, timing, versions, usage, and decisions.

Implementation

Use standard trace context across services and span links across durable resumes, async callbacks, or approvals.

Operational implications

Keep baggage small and non-sensitive; store protected values by reference.

Measure

Propagation success, orphan spans, clock skew, dropped spans, and trace duration.

Phase timing

Actionable latency separates gateway, admission, queue, tokenization, retrieval, route, load/warmup, prefill, decode, tool, policy, memory, evaluation, and delivery.

Implementation

Adopt consistent span names and histogram boundaries by workload class.

Operational implications

One “AI latency” field hides the owner and cause of regressions.

Measure

p50/p95/p99 per phase, critical path, overlap, and SLO contribution.

Token, cache, and scheduler metrics

Generation systems need prompt/cached/output tokens, active tokens, blocks, batch composition, queue, cache hit, evictions, and cleanup.

Implementation

Define token accounting and cache metrics consistently across runtimes and providers.

Operational implications

High hit rate can have little value if matched prefixes are short; active request count can hide token pressure.

Measure

Cached/prefill tokens avoided, active blocks, allocation failure, queue, Goodput, and cleanup.

Tool, policy, memory, and approval events

Agentic traces need tool proposal, validation, authorization, approval, execution, result validation, memory reads/writes, and policy effects.

Implementation

Record normalized schemas, stable resource refs, decision IDs, side-effect class, idempotency, and redaction state.

Operational implications

Do not log credentials, raw secrets, or uncontrolled personal data.

Measure

Tool success/retry, policy deny, approval time, memory write outcome, and side-effect verification.

Cost and attribution

Cost can include model/provider usage, accelerator time, external tools, storage, transfer, and evaluations.

Implementation

Use controlled tenant/product/model dimensions and reconcile estimates with billing.

Operational implications

Avoid user-controlled strings in metric labels; separate estimated from invoiced cost.

Measure

Cost/request, cost/success, cost/model/tenant, variance, and unattributed cost.

Sampling and retention

Sampling controls cost and privacy but can bias incident and quality evidence.

Implementation

Use head sampling for predictable volume, tail sampling for errors/slow traces, and deterministic cohorts for evaluation.

Operational implications

Apply classification, access, retention, deletion, and export audit to telemetry systems.

Measure

Sample rate by class, retained errors, retention age, deletion success, and access audit.

Replay and incident analysis

Replay reconstructs contract, versions, references, state checkpoints, and control decisions.

Implementation

Preserve immutable version identifiers and protected evidence refs; publish redacted incident timelines.

Operational implications

A repeated stochastic model call is not identical replay, but control flow and side effects can be audited.

Measure

Replay completeness, missing evidence, reproduction rate, time to diagnosis, and corrective action.

Low-level profiling

Kernel, CPU, memory, and network profiles explain hotspots beneath request spans.

Implementation

Correlate samples with trace/request IDs where possible and bound profiling overhead.

Operational implications

Use targeted profiling during controlled windows; always-on high-detail profiling can distort performance.

Measure

Kernel occupancy, CPU hotspots, allocation, bandwidth, network stalls, and profiling overhead.

Reference tables

Observability signals
Signal Best use AI runtime examples Risk
Trace Explain one distributed request Queue, prefill, decode, tool, policy, memory spans Sensitive attributes/incomplete propagation
Metric Fleet trends and SLOs TTFT, cache hit, Goodput, GPU memory High cardinality/misleading averages
Log Discrete event detail Model load, policy denial, eviction Unstructured volume/secrets
Evaluation Quality/outcome assessment Citation quality, task success Evaluator drift/unclear criteria
Profile Low-level bottlenecks Kernel, CPU, memory, network Overhead/correlation difficulty
Minimum trace fields
Field group Examples
Identity/contract Trace ID, request ID, actor/tenant refs, contract version
Versions Model, runtime, backend, prompt, tool, policy, evaluator
Timing Queue, prefill, decode, tool, approval, evaluation, E2E
Usage Prompt/cached/output tokens, bytes, cache blocks, cost
Decisions Route, policy effect, approval, retry/fallback
Outcome Status, finish reason, task evaluation, human handoff
Privacy Classification, redaction, retention, evidence refs

Decision checklist

  1. What root identifier crosses every component and async boundary?
  2. Which phase timings diagnose SLOs?
  3. Which versions must be attached to every result?
  4. What data is stored directly versus by protected reference?
  5. How are sampling and retention different for failures/evaluations?
  6. How is high-cardinality metadata controlled?
  7. Can replay reconstruct decisions and state transitions?
  8. Who can access, export, and delete telemetry?

Common mistakes

  • Logging full prompts and tool results by default.
  • Using one average latency metric for the whole runtime.
  • Omitting model, runtime, tool, or policy versions.
  • Putting user-controlled strings into metric labels.
  • Sampling away rare policy failures or tail latency.
  • Assuming simple parentage across durable workflows.
  • Calling a reissued prompt a deterministic replay.

Sources and further reading


  1. OpenTelemetry concepts
    (opens in a new tab)

    OpenTelemetry · Official documentation · accessed 2026-06-21 UTC

  2. Trace semantic conventions
    (opens in a new tab)

    OpenTelemetry · Official specification · accessed 2026-06-21 UTC

  3. Generative AI semantic conventions
    (opens in a new tab)

    OpenTelemetry · Official specification · accessed 2026-06-21 UTC

  4. Trace Context
    (opens in a new tab)

    W3C · Standard · accessed 2026-06-21 UTC

  5. NIST Privacy Framework
    (opens in a new tab)

    NIST · Government framework · accessed 2026-06-21 UTC

Last reviewed: 2026-06-21 UTC

Maintenance record

Found an error, outdated capability, or unclear category boundary? Submit a correction with a supporting source.