Key takeaways
- Runtime execution has two connected tracks: preparing a model program for a target environment and processing each production request through an operational control path.
- The model-preparation path resolves graph semantics, shapes, optimization, partitioning, code generation or kernel selection, packaging, loading, and memory planning before useful work can begin.
- The request path adds identity, context, routing, admission, queueing, batching, model execution, tool controls, validation, telemetry, memory, and response packaging around the model call.
- Failures must be classified by stage. Retrying a transient queue timeout is different from retrying an ambiguous side-effecting tool call or a deterministic schema violation.
- Reproducibility depends on more than a random seed. Artifact versions, compiler options, hardware, routing, context, tools, policy, and trace data must also be controlled.
Definition and scope
Two meanings of execution
“Execution” can describe two very different activities. The first is program preparation: translating or packaging a trained model so a runtime can load it and dispatch work to supported hardware. The second is request execution: converting an authorized user or system task into a scheduled model call, tool activity, policy decision, trace, and response.
Compiler documentation tends to emphasize graph semantics, intermediate representations, optimization, lowering, and target code. OpenXLA describes XLA as an ML compiler that optimizes linear algebra for execution speed and memory use, while StableHLO defines a portable high-level operation set between frameworks and compilers.[1][2] Runtime and serving documentation emphasizes loading, provider assignment, scheduling, batching, state, health, and telemetry. ONNX Runtime, for example, converts a model into an in-memory graph, performs provider-independent optimization, partitions supported subgraphs among execution providers, and executes the rewritten graph.[3]
Track A
Make the model executable
How does a framework model or model artifact become a runnable program for specific hardware?
Capture or import the model, represent it in an intermediate form, analyze shapes and types, optimize and partition the graph, lower or select kernels, package the result, allocate memory, load it, and warm the execution path.
Result: An executable graph, engine, module, runtime package, delegated subgraphs, model repository entry, or combination of these artifacts.
Track B
Handle a production request
How does an authorized task become a reliable, observable result?
Normalize identity and intent, assemble approved context, select a model route, admit and schedule the request, execute inference, broker tools, validate outputs, record policy and telemetry, update explicit memory, and package a response or human handoff.
Result: A structured result, evidence, warnings, tool outcomes, policy decisions, timing and cost data, memory changes, and a trace identifier.
The tracks meet at the loaded execution boundary
The preparation path determines what can execute, where it can execute, and which assumptions are fixed. The request path decides whether a specific task may execute now, which prepared path to use, how to coordinate surrounding work, and what evidence must be retained.
Track A
How a model becomes executable
A trained checkpoint is not automatically a portable executable. A deployment path must make operator semantics, inputs, outputs, shapes, types, control flow, target devices, memory behavior, and compatibility assumptions explicit. Depending on the stack, the result may be an interpreted graph, a set of delegated subgraphs, compiled machine code, a bytecode module, a backend engine, or a model package that combines several forms.
Product boundaries overlap. ONNX Runtime demonstrates a graph-runtime path that performs provider-independent optimization and then partitions supported subgraphs among ordered execution providers.[3] ExecuTorch demonstrates an ahead-of-time edge path in which a graph is lowered, operators can be converted to out variants, memory can be planned before serialization, and a small C++ runtime loads and executes the prepared program.[4][5]
Nine ordered stages transform a framework model or model artifact into a loaded and warmed execution path: capture, intermediate representation, analysis, optimization, partitioning, lowering, packaging, loading, and warmup.
-
01
Capture, export, or import
Separate deployment-relevant tensor computation from training-only behavior and unsupported host-language effects. Record operator, control-flow, parameter, tokenizer, and preprocessing assumptions.
-
02
Create and verify the intermediate representation
Build a typed graph or module whose operations have explicit semantics. Validate structural invariants, references, operator versions, and required metadata before optimization.
-
03
Analyze shapes, types, layouts, and constraints
Propagate tensor ranks, dimensions, element types, symbolic bounds, memory layouts, aliases, and device constraints. Create guards or specializations where dynamic values affect compilation.
-
04
Optimize the graph and memory behavior
Apply transformations such as constant folding, canonicalization, dead-code elimination, algebraic simplification, common-subexpression elimination, fusion, quantization rewrites, buffer reuse, and rematerialization where appropriate.
-
05
Partition and assign backends
Query backend capabilities, group supported subgraphs, decide placement, insert conversions at boundaries, and retain fallback paths for unsupported operations.
-
06
Lower, schedule, and generate or select code
Lower high-level operations into target-specific IR, library calls, generated kernels, or bytecode. Choose layouts, tiling, vectorization, parallelism, launch structure, and device-specific schedules.
-
07
Package and serialize the deployable artifact
Create a versioned package or model-repository entry. Record checksums, compatibility, input and output contracts, required libraries, quantization information, and provenance.
-
08
Load, allocate, and initialize runtime state
Verify integrity, map or copy weights, initialize execution providers, allocate tensor arenas or cache pools, register kernels, establish streams and communication groups, and create serving instances.
-
09
Warm the path and establish a baseline
Trigger lazy compilation, populate caches, establish memory pools, validate outputs against fixtures, and collect a baseline for latency, memory, and correctness before admitting live traffic.
Stage 01
Capture, export, or import
- Receives
- Framework program, exported graph, interchange format, or compiler-facing IR.
- Performs
- Separate deployment-relevant tensor computation from training-only behavior and unsupported host-language effects. Record operator, control-flow, parameter, tokenizer, and preprocessing assumptions.
- Emits
- A graph or program representation suitable for analysis and transformation.
- Typical failure
- Unsupported operator, hidden Python or framework side effect, incompatible model version, missing external data, or incomplete preprocessing contract.
- Record
- Exporter version, source framework, opset or dialect, warnings, unsupported-node inventory, and artifact digest.
Stage 02
Create and verify the intermediate representation
- Receives
- Captured model program plus weights, constants, metadata, and declared inputs.
- Performs
- Build a typed graph or module whose operations have explicit semantics. Validate structural invariants, references, operator versions, and required metadata before optimization.
- Emits
- A valid high-level IR with stable identifiers and explicit data dependencies.
- Typical failure
- Invalid graph, unresolved reference, unsupported dialect or opset, inconsistent metadata, or ambiguous control flow.
- Record
- IR version, node and parameter counts, control-flow inventory, validation result, and normalized model signature.
Stage 03
Analyze shapes, types, layouts, and constraints
- Receives
- Validated IR and representative or declared input constraints.
- Performs
- Propagate tensor ranks, dimensions, element types, symbolic bounds, memory layouts, aliases, and device constraints. Create guards or specializations where dynamic values affect compilation.
- Emits
- Annotated graph, symbolic constraints, specialization plan, and compatibility diagnostics.
- Typical failure
- Unsatisfied shape bound, unsupported dynamic rank, data-type mismatch, illegal layout, or specialization explosion.
- Record
- Shape profile, guard count, dynamic-dimension set, layout decisions, and specialization cache key.
Stage 04
Optimize the graph and memory behavior
- Receives
- Analyzed IR and target-independent optimization policy.
- Performs
- Apply transformations such as constant folding, canonicalization, dead-code elimination, algebraic simplification, common-subexpression elimination, fusion, quantization rewrites, buffer reuse, and rematerialization where appropriate.
- Emits
- A semantically equivalent graph with reduced work, traffic, allocation, or launch overhead.
- Typical failure
- Numerical drift, invalid rewrite, excessive compile time, register or memory pressure, lost debuggability, or a transformation that is slower for the actual workload.
- Record
- Pass list, before-and-after graph statistics, numerical validation result, estimated memory, and optimization timing.
Stage 05
Partition and assign backends
- Receives
- Optimized graph plus an ordered set of execution providers, delegates, libraries, or device backends.
- Performs
- Query backend capabilities, group supported subgraphs, decide placement, insert conversions at boundaries, and retain fallback paths for unsupported operations.
- Emits
- A partitioned graph with explicit backend ownership and transfer boundaries.
- Typical failure
- Unsupported operation, excessive host-device transfers, incompatible memory ownership, fallback that violates latency goals, or provider initialization failure.
- Record
- Partition map, fallback node count, transfer edges, provider versions, and estimated copy volume.
Stage 06
Lower, schedule, and generate or select code
- Receives
- Backend-assigned subgraphs and target hardware description.
- Performs
- Lower high-level operations into target-specific IR, library calls, generated kernels, or bytecode. Choose layouts, tiling, vectorization, parallelism, launch structure, and device-specific schedules.
- Emits
- Compiled code, selected kernels, backend blobs, or runtime instructions tied to target capabilities.
- Typical failure
- Code-generation error, missing target feature, failed autotune, invalid binary, compiler crash, or performance regression.
- Record
- Target triple or device, compiler flags, code size, tuning record, kernel inventory, and build duration.
Stage 07
Package and serialize the deployable artifact
- Receives
- Generated modules, weights, metadata, tokenizer or preprocessing assets, and runtime requirements.
- Performs
- Create a versioned package or model-repository entry. Record checksums, compatibility, input and output contracts, required libraries, quantization information, and provenance.
- Emits
- An immutable, addressable deployment artifact with a documented compatibility envelope.
- Typical failure
- Missing dependency, mismatched weight and graph versions, incomplete metadata, corrupt serialization, or unsafe unsigned artifact.
- Record
- Artifact digest, size, manifest, dependency versions, signature status, and build provenance.
Stage 08
Load, allocate, and initialize runtime state
- Receives
- Deployable artifact, runtime configuration, device inventory, and resource limits.
- Performs
- Verify integrity, map or copy weights, initialize execution providers, allocate tensor arenas or cache pools, register kernels, establish streams and communication groups, and create serving instances.
- Emits
- A loaded model or engine ready for readiness checks and controlled execution.
- Typical failure
- Insufficient memory, incompatible driver, provider mismatch, allocation fragmentation, corrupted artifact, or initialization timeout.
- Record
- Load time, resident memory, device allocation, cache capacity, instance count, and readiness state.
Stage 09
Warm the path and establish a baseline
- Receives
- Loaded runtime plus representative inputs and health policy.
- Performs
- Trigger lazy compilation, populate caches, establish memory pools, validate outputs against fixtures, and collect a baseline for latency, memory, and correctness before admitting live traffic.
- Emits
- A ready execution path with known cache state, validated output behavior, and baseline telemetry.
- Typical failure
- Warmup mismatch, delayed JIT failure, shape-cache miss, unstable latency, unavailable dependency, or failed health threshold.
- Record
- Warmup count, first-run and steady-state latency, compilation cache events, output comparison, and readiness decision.
Ahead of time
Move work out of the critical request path
Static analysis, memory planning, selective builds, code generation, and package validation can reduce first-request work and make constrained deployments more predictable. ExecuTorch explicitly plans mutable tensor placement in fixed-size arenas before emitting the runtime program.[5]
At load or first use
Retain flexibility where the workload requires it
Dynamic shapes, provider discovery, runtime partitioning, lazy compilation, autotuning, and cache construction may happen when a model loads or a new input profile appears. These paths need bounded concurrency, persistent caches, readiness checks, and observable fallbacks.
Track B
How a production request is executed
A model call is only one span in a production request. The runtime must know who is asking, what authority is delegated, which data may be retrieved, which route satisfies privacy and service objectives, how work enters constrained capacity, whether a tool action is permitted, what output contract applies, and how the result can be investigated later.
Typed execution boundaries reduce ambiguity. OpenAPI provides a language-neutral description of HTTP interfaces, while JSON Schema supplies a vocabulary for validating JSON structure and constraints.[12][13] MCP standardizes connections between LLM applications and external data sources or tools, but its tool metadata and transport authorization are inputs to runtime governance rather than substitutes for product authorization and policy.[14][15][16]
The client submits a request to a boundary. The runtime authenticates it, assembles context, invokes an inference engine, optionally brokers a tool, validates and traces the result, then returns a response, continuation, or human-review state.
-
1
ClientBoundary
Submit versioned request with identity, task, deadline, and output contract.
-
2
BoundaryRuntime
Authenticate, authorize, normalize, classify risk, and start the trace.
-
3
RuntimeContext
Retrieve approved data and return provenance-aware context.
-
4
RuntimeInference
Select route, admit, schedule, batch, execute prefill, and begin decode or prediction.
-
5
InferenceRuntime
Return tokens, prediction, structured candidate, usage, and engine events.
-
6
RuntimeTool broker
Validate and authorize any typed tool request before execution.
-
7
Tool brokerRuntime
Return typed result, side-effect status, evidence, and audit fields.
-
8
RuntimeTrust and trace
Validate output, record policy, evaluation, timing, cost, and approved memory changes.
-
9
RuntimeClient
Return final response, stream completion, continuation token, or human-review state.
Request boundary and identity
- Receives
- Actor, tenant, session, task, input payload, deadline, budget, requested output, and transport metadata.
- Runtime work
- Authenticate the caller, normalize the task, assign a request and trace identifier, establish delegated authority, classify risk, and reject malformed or unauthorized work early.
- Emits
- A versioned internal request envelope with explicit authority and service objectives.
- Controls
- Authentication, tenant isolation, quotas, input-size limits, rate limits, schema validation, and replay protection.
Context assembly
- Receives
- Normalized task, context policy, conversation state, approved data sources, and retrieval constraints.
- Runtime work
- Retrieve, rank, filter, redact, deduplicate, compress, and cite the smallest useful context set. Preserve provenance and keep untrusted content distinguishable from instructions.
- Emits
- A bounded context package with source references, classifications, freshness, and redaction status.
- Controls
- Data classification, source allowlists, authorization-aware retrieval, prompt-injection treatment, token budget, and retention rules.
Model route and admission
- Receives
- Task requirements, context size, risk, privacy boundary, capability needs, latency target, provider state, and budget.
- Runtime work
- Choose a local, hosted, specialist, fallback, or multimodel route. Decide whether the request can enter the system now or must wait, degrade, redirect, or fail fast.
- Emits
- A route decision, execution profile, priority, admission result, and fallback plan.
- Controls
- Capability policy, residency rules, cost ceiling, provider health, concurrency quotas, circuit breakers, and deterministic fallbacks.
Queueing, scheduling, and batching
- Receives
- Admitted request, route, priority, token or tensor estimates, cache hints, and deadline.
- Runtime work
- Place the request in the appropriate queue, combine compatible work, reserve memory, enforce fairness, support cancellation, and apply backpressure when capacity is constrained.
- Emits
- A scheduled execution unit with batch membership, reserved resources, and queue timing.
- Controls
- Queue limits, priority policy, deadline handling, batch compatibility, admission watermarks, cancellation, and preemption rules.
Prefill or initial model execution
- Receives
- Scheduled inputs, model state, prompt or tensors, cache references, and execution parameters.
- Runtime work
- Run the initial forward computation. For autoregressive language models this normally processes the input sequence and creates the initial KV-cache state; for other models it may complete the full inference in one execution.
- Emits
- First-token state, model outputs, logits, embeddings, predictions, or intermediate state for continued generation.
- Controls
- Maximum input size, precision, device placement, timeout, memory reservation, and provider-specific execution limits.
Decode, stream, or iterate
- Receives
- Current generation state, cache state, decoding policy, cancellation signal, and remaining budget.
- Runtime work
- Generate additional tokens or iterative outputs, update cache state, stream partial results, apply stop conditions, and interleave other admitted work according to scheduler policy.
- Emits
- Token or event stream, completed model output, usage counters, and updated execution state.
- Controls
- Sampling configuration, structured-generation constraints, stop rules, maximum output, stream backpressure, and cancellation.
Tool calls and structured actions
- Receives
- Model-proposed tool name and arguments, current authority, tool contract, state, and approval policy.
- Runtime work
- Validate arguments, resolve credentials outside the model context, authorize the action, enforce timeout and idempotency rules, execute in an isolated adapter, and classify the result or side effect.
- Emits
- Typed tool result, side-effect record, evidence, error envelope, and approval state.
- Controls
- Schema validation, deterministic authorization, allowlists, egress policy, rate limits, sandboxing, human approval, idempotency keys, and audit logging.
Output validation and policy
- Receives
- Model output, tool results, evidence, product rules, policy state, and output contract.
- Runtime work
- Validate structure, provenance, safety, completeness, citations, business invariants, and permissions. Repair only when the repair path is explicit and bounded; otherwise fail or request human review.
- Emits
- Accepted output, warning, policy rejection, repair request, or human-review task.
- Controls
- JSON or domain schema, policy decision points, citation checks, confidence or evaluation thresholds, and irreversible-action gates.
Telemetry, evaluation, and memory
- Receives
- Stage events, timing, token or byte counts, route and tool decisions, policy results, evaluation outputs, and candidate memory changes.
- Runtime work
- Emit correlated spans, metrics, logs, cost records, and evaluation data. Apply redaction and retention rules. Persist only explicit, policy-approved memory with provenance and deletion semantics.
- Emits
- Trace, metrics, audit records, evaluation result, cost summary, and approved memory mutations.
- Controls
- Sensitive-data redaction, telemetry sampling, retention, memory-write policy, access control, and incident evidence requirements.
Response, continuation, or human handoff
- Receives
- Validated result, evidence, warnings, trace summary, memory result, and review state.
- Runtime work
- Package the response for the caller, finalize streaming, expose citations and warnings, return a continuation token for asynchronous work, or transfer the task to a human with sufficient context to act.
- Emits
- Versioned response envelope, final status, trace identifier, continuation or review reference, and user-visible outcome.
- Controls
- Output minimization, disclosure policy, response schema, resumability, final authorization check, and delivery guarantees.
Inference mechanics
Prefill, decode, scheduling, and model-serving boundaries
Autoregressive language-model serving makes the request path especially visible. The initial prompt work creates or extends cache state and produces the first output state; subsequent decode work advances generation incrementally. A scheduler must share device capacity among prompts, active decodes, cache reservations, priorities, deadlines, and cancellations.
vLLM documents PagedAttention-based KV-cache management, continuous batching, prefix caching, streaming, quantization, and distributed execution.[7] Its chunked-prefill guidance describes a policy that prioritizes active decode work and fits prompt chunks into the remaining token budget, exposing an explicit trade-off among throughput, time to first token, and inter-token latency.[8]
A model server adds a distinct operating boundary. Triton receives inference requests through HTTP, gRPC, or a C API, routes them to per-model schedulers, optionally batches them, and invokes the configured backend.[9] Its model-execution documentation distinguishes stateless scheduling from sequence-aware stateful scheduling and allows multiple model instances to execute concurrently.[10]
Queue
Waiting is runtime work
Queue time reveals capacity pressure, fairness, and admission quality. Record it separately from model time; a fast engine behind an overloaded queue is still a slow service.
Prefill
Process input and establish state
Prompt length, cache reuse, batching, device memory, and chunking affect time to first token and the capacity available for active decodes.
Decode
Advance generation incrementally
Inter-token latency, scheduling cadence, stream delivery, stop rules, and cache pressure shape the interactive experience.
Serving
Operate engines as a service
Protocols, health, model versions, schedulers, instances, metrics, rollouts, and autoscaling belong to the serving layer, not to model mathematics alone.
Do not collapse engine metrics into user latency
Triton distinguishes request counts, inference counts, execution counts, failures, pending requests, and latency-related metrics.[11] A production trace should additionally separate boundary, context, route, queue, tool, validation, memory, and response-packaging time.
Reliability
Failure handling by stage and retry safety
“Retry on error” is not a runtime strategy. The runtime must identify the failed stage, determine whether the operation is transient, know whether any side effect may have completed, preserve the original deadline and authority, and record each attempt. Preparation failures should normally block readiness; request failures should return a classified status, invoke a bounded fallback, resume from a durable checkpoint, or escalate to a human.
| Failure class | Signal | Retry rule | Containment | Required evidence |
|---|---|---|---|---|
| Artifact or compatibility failure | Model cannot be parsed, validated, loaded, or matched to the runtime and device. | Do not retry unchanged input. Select a compatible artifact or runtime version. | Keep the previous known-good artifact active; fail deployment readiness rather than live requests. | Artifact digest, model and runtime versions, operator or dialect details, provider logs, and validation output. |
| Shape, guard, or specialization miss | Input violates compiled assumptions or triggers an unavailable specialization. | Retry only through an explicit dynamic path, recompilation path, or compatible profile. | Bound compilation concurrency and cache growth; reject hostile or unbounded shape variation. | Input signature, guard expression, cache key, compile event, and selected fallback. |
| Capacity or admission failure | Queue, memory, concurrency, device, or provider capacity cannot meet the request objective. | Retry with jitter only when the deadline and idempotency policy permit it; otherwise degrade or fail fast. | Admission limits, circuit breakers, backpressure, queue caps, load shedding, and alternate routes. | Queue depth, reservation failure, memory pressure, provider health, deadline, and route decision. |
| Timeout, cancellation, or partial stream | The request exceeds its deadline, the caller cancels, or a stream terminates before completion. | Retry from a documented continuation point when supported; do not silently duplicate completed side effects. | Propagate cancellation, release cache and reservations, terminate downstream work, and mark the response incomplete. | Deadline, cancellation origin, last completed span, emitted output count, and released resources. |
| Model or provider execution failure | Kernel, backend, network provider, or model process returns an error or unhealthy result. | Use a bounded retry or fallback only for classified transient failures and compatible output contracts. | Circuit-break the failing route, isolate the instance, and avoid cross-tenant state leakage during recovery. | Provider and model identifiers, attempt, error class, device state, fallback, and output completeness. |
| Tool failure or ambiguous side effect | A tool times out, returns an invalid result, or may have committed a write without a definitive response. | Retry only with a stable idempotency key or a verified read-before-retry strategy. | Suspend the workflow, reconcile external state, require approval when ambiguity affects money, access, or records. | Tool version, arguments reference, idempotency key, authorization decision, attempt history, and side-effect status. |
| Validation or policy rejection | Output violates schema, evidence, safety, permission, or business rules. | Use a bounded repair path only when the violation is repairable and the policy permits another model call. | Do not expose or execute rejected content; route to a safer fallback or human review. | Validator and policy versions, failed rule, redacted sample, repair attempts, and final disposition. |
| Telemetry or memory persistence failure | Required audit, trace, evaluation, or memory write cannot be committed. | Retry according to durability requirements; distinguish optional analytics from mandatory audit evidence. | Fail closed for mandatory audit paths, queue durable outbox events, and avoid acknowledging memory that was not stored. | Record type, storage target, durability class, retry state, redaction state, and user-visible warning. |
Reproduction
Determinism and reproducibility
Determinism means identical controlled inputs produce identical outputs under a stated execution configuration. Reproducibility is broader: another run or environment can reconstruct the relevant behavior and explain differences. Hardware kernels, parallel reduction order, asynchronous scheduling, cache state, dynamic compilation, sampling, retrieval, and external tools can all introduce variation. OpenXLA documents GPU determinism as an explicit configuration and performance concern rather than an automatic property.[18]
Artifact identity
Pin model, tokenizer, adapters, preprocessing, quantization, and package digests rather than using mutable names.
Compiler and runtime configuration
Record compiler passes, flags, optimization profiles, execution-provider order, kernel libraries, and runtime versions.
Hardware and driver state
Capture device model, topology, driver, firmware, precision mode, clocks or power policy when relevant, and distributed communication settings.
Scheduling and concurrency
Batch composition, request order, cache state, parallel reductions, and preemption can alter latency and sometimes numerical results.
Generation policy
Record seed, sampler, temperature, top-k or top-p values, stop rules, structured-output constraints, and maximum lengths.
Context and retrieval
Version the query transformation, indexes, filters, ranking logic, source snapshots, redaction rules, and assembled context digest.
Tools and external state
Record tool versions, authorization, arguments, idempotency keys, external resource versions, and results or side effects.
Policy and evaluation
Pin policy, schema, validator, evaluation dataset, rubric, and threshold versions used to accept, reject, or escalate output.
A replayable trace is not necessarily a verbatim replay
For nondeterministic or external systems, replay may mean reconstructing the request, decisions, artifacts, context, tool evidence, and policy state closely enough to evaluate the outcome. Store immutable references and redacted evidence rather than assuming every dependency can be rerun forever.
Longer-lived work
Streaming and asynchronous execution
A synchronous request-response API is not enough for every workload. Streaming needs a terminal status and backpressure. Background tasks need durable identity and authorized status retrieval. Long-running workflows need checkpoints, idempotent activities, version-aware resume, and compensation. Human review needs an explicit pause state rather than an informal message in a prompt.
Incremental streaming
Use when: Interactive generation, progressive results, or long media and document processing.
Runtime requirement: Backpressure, cancellation propagation, partial-output labeling, ordering, stream timeouts, and a clear terminal event.
Background task
Use when: Work that outlives the client connection but has a bounded completion time.
Runtime requirement: Durable task identity, status API, retry policy, output storage, expiration, and caller authorization on resume.
Durable workflow
Use when: Long-running, multistep work with tools, waits, approvals, and compensation.
Runtime requirement: Checkpointed state, deterministic transitions, idempotent activities, version-aware resume, and explicit compensation.
Event-driven continuation
Use when: The next step depends on an external callback, message, file, or scheduled event.
Runtime requirement: Correlation, deduplication, event authenticity, replay protection, dead-letter handling, and timeout escalation.
Human review
Use when: A privileged, ambiguous, high-impact, or policy-sensitive action requires a person.
Runtime requirement: Review packet, least-privilege decision interface, expiry, reviewer identity, approval scope, and resumable state.
Observability
Example runtime trace timeline
The following synthetic trace illustrates how end-to-end latency can be decomposed. It is not a benchmark and does not represent a specific product. The important design is the shared trace identity and stage-level evidence, not the example duration.
OpenTelemetry’s dedicated GenAI semantic-conventions repository defines work on spans, metrics, events, MCP, and provider-specific conventions.[17] A production schema should extend standard telemetry with the runtime’s identity, policy, context provenance, tool, memory, evaluation, and review requirements.
A twelve-span synthetic trace begins with boundary authorization, overlaps context retrieval with request setup, then records route, queue, prefill, decode, tool authorization and execution, final generation, validation, trace and memory commit, and response packaging.
-
Boundary and authorization API gateway / runtime0–24 ms
Accepted; tenant and authority attached.
-
Context retrieval Context service18–156 ms
Six approved sources; one document redacted.
-
Route and admission Router / scheduler150–166 ms
Primary local route; 1.5 s deadline retained.
-
Queue Inference scheduler160–234 ms
Joined compatible batch; cache reservation successful.
-
Prefill Inference engine230–516 ms
Prompt processed; first token available.
-
Decode and stream Inference engine510–936 ms
Structured tool request and partial response emitted.
-
Tool authorization Tool broker / policy702–724 ms
Read-only CRM lookup allowed.
-
Tool execution CRM adapter720–884 ms
Typed result returned; no side effect.
-
Final model continuation Inference engine882–1136 ms
Answer completed with citations.
-
Validation and evaluation Trust plane1128–1220 ms
Schema valid; evidence and policy checks passed.
-
Trace and memory commit Telemetry / memory1210–1258 ms
Trace committed; no long-term memory change.
-
Response package Runtime1254–1272 ms
Completed in 1.272 seconds.
Conceptual trace event envelope
{
"traceId": "trc_01JYEXAMPLE",
"spanId": "spn_tool_07",
"parentSpanId": "spn_runtime_01",
"eventType": "tool.completed",
"timestampUtc": "2026-06-21T18:42:16.428Z",
"component": "crm-read-adapter",
"attempt": 1,
"requestRef": "req_01JYEXAMPLE",
"actorRef": "actor_redacted",
"modelRoute": "local-primary",
"tool": { "name": "crm.lookup", "version": "2.4.1" },
"durationMs": 164,
"policyDecision": "allow",
"sideEffect": "none",
"redactionStatus": "payload-redacted",
"resultRef": "evidence://sha256/example"
}
Synthetic example. Identifiers and evidence references are illustrative.
Implementation guidance
End-to-end runtime design checklist
Model preparation
- Pin the source model, tokenizer, preprocessing, adapters, exporter, compiler, and runtime versions.
- Validate graph semantics, shapes, types, partitioning, numerical parity, and fallback behavior.
- Record artifact digests, target hardware, optimization flags, memory plan, and compatibility bounds.
- Separate build-time failures from deployment-readiness failures and live request failures.
Request contract
- Use a versioned request envelope with actor, tenant, task, permissions, deadline, budget, context policy, model constraints, tools, memory, output contract, and trace settings.
- Reject invalid or unauthorized requests before expensive retrieval or model execution.
- Carry request, trace, and idempotency identifiers across every boundary.
- Define cancellation, retry, fallback, and partial-response semantics before production traffic.
Scheduling and capacity
- Measure queue time separately from model time, tool time, validation time, and end-to-end latency.
- Make batching, cache reuse, priority, fairness, admission, preemption, and backpressure policies explicit.
- Reserve enough capacity for cancellation cleanup, health checks, and degraded-mode operation.
- Load-test realistic input and output lengths, concurrency, cache state, and tool latency distributions.
Tools, policy, and memory
- Enforce tool authorization, schemas, credentials, idempotency, egress, timeouts, rate limits, and approval outside the prompt.
- Distinguish read operations, reversible writes, irreversible writes, and actions that affect money, identity, access, or regulated records.
- Make memory writes explicit, provenance-aware, reviewable, and deletable.
- Fail closed when mandatory audit or policy evidence cannot be recorded.
Observability and recovery
- Emit correlated spans and metrics for boundary, context, route, queue, model, tools, policy, evaluation, memory, and response packaging.
- Redact secrets, personal data, sensitive prompts, and tool payloads according to a documented telemetry policy.
- Classify errors by stage and retry safety; never use a generic retry loop for ambiguous side effects.
- Maintain replay fixtures, failure-injection tests, known-good fallbacks, and incident procedures for each external dependency.
FAQ
Frequently asked questions
Does every runtime compile a model before execution?
No. Some runtimes interpret or dispatch a graph, some compile ahead of time, some compile or specialize lazily, and many combine compiled subgraphs with interpreted or library-backed fallbacks. The important questions are what is prepared before deployment, what may happen on the first request, and which assumptions trigger recompilation or fallback.
Why separate model preparation from request execution?
They have different inputs, owners, failure modes, and controls. Model preparation concerns compatibility, optimization, code generation, packaging, and readiness. Request execution concerns identity, context, scheduling, policy, tools, telemetry, and user-visible service objectives. Combining them conceptually hides failures and makes reproducibility harder.
What happens during LLM prefill and decode?
Prefill processes the input sequence and establishes attention cache state before or while producing the first output token. Decode then advances generation incrementally while reusing prior state. Runtime schedulers balance prompt work, active generation, cache capacity, latency, and throughput.
Where should tool authorization occur?
In deterministic runtime or infrastructure controls between the model proposal and execution. The runtime should validate the tool and argument schema, identify the actor and delegated authority, enforce permissions and egress rules, obtain required approval, and record the decision before the adapter executes.
Can a runtime safely retry any failed request?
No. Retries are safe only when the failure is classified and the operation is idempotent or can be reconciled. A model-provider timeout may be retryable; a tool call that may have charged a card or changed access requires an idempotency key or an external state check before any retry.
Is a random seed enough for reproducibility?
No. Seeds do not control mutable model aliases, compiler versions, kernels, distributed reduction order, batch composition, retrieval results, cache state, external tools, policy versions, or hardware behavior. Reproduction requires a complete execution manifest and trace.
What should a runtime expose in a response?
At minimum: request and status, structured result, warnings, evidence or citations, tool outcomes when appropriate, model-route summary, policy and review state, memory changes, trace identifier, and timing or cost summary. Sensitive internal reasoning and secrets should not be exposed.
How should long-running work resume after a failure?
Resume from a durable checkpoint whose schema and workflow version are known. Revalidate authority and deadlines, reconcile completed side effects, use idempotent activities, and record why the workflow resumed, retried, compensated, or required human intervention.
Sources and further reading
This page uses official project documentation and specifications to describe execution responsibilities. Product examples illustrate boundaries and are not endorsements or benchmark rankings.
- XLA architecture Opens in a new tab. — OpenXLA project; Official project documentation. Compiler objectives and high-level path from HLO through optimization to target-specific execution. Accessed 2026-06-21 UTC.
- StableHLO Specification Opens in a new tab. — OpenXLA project; Official specification. Portable high-level operation semantics between model frameworks and ML compilers. Accessed 2026-06-21 UTC.
- ONNX Runtime Architecture Opens in a new tab. — ONNX Runtime project; Official project documentation. Graph conversion, provider-independent optimization, execution-provider capability discovery, partitioning, compilation, and execution. Accessed 2026-06-21 UTC.
- Architecture and Components Opens in a new tab. — PyTorch / ExecuTorch project; Official project documentation. Export, lowering, ahead-of-time preparation, runtime packaging, loading, and execution for constrained devices. Accessed 2026-06-21 UTC.
- Memory Planning Opens in a new tab. — PyTorch / ExecuTorch project; Official project documentation. Tensor lifetime and size analysis for placement into fixed-size memory arenas. Accessed 2026-06-21 UTC.
- IREE Extensions and Architecture Guidelines Opens in a new tab. — IREE project; Official project documentation. Compiler/runtime separation and host-code versus device-code execution boundaries. Accessed 2026-06-21 UTC.
- vLLM Documentation Opens in a new tab. — vLLM project; Official project documentation. Paged attention memory management, continuous batching, prefix caching, streaming, quantization, and distributed inference capabilities. Accessed 2026-06-21 UTC.
- Optimization and Tuning: Chunked Prefill Opens in a new tab. — vLLM project; Official project documentation. Scheduling trade-offs between prefill and decode, chunking, throughput, time to first token, and inter-token latency. Accessed 2026-06-21 UTC.
- Triton Architecture Opens in a new tab. — NVIDIA; Official project documentation. Model repositories, HTTP and gRPC requests, per-model schedulers, batching, backends, model management, health, and metrics. Accessed 2026-06-21 UTC.
- Concurrent Model Execution and Schedulers Opens in a new tab. — NVIDIA; Official project documentation. Model instances, concurrency, stateless and stateful models, dynamic batching, sequence scheduling, and correlation. Accessed 2026-06-21 UTC.
- Triton Metrics Opens in a new tab. — NVIDIA; Official project documentation. Request, inference, execution, failure, pending-request, latency, and batch-related metrics. Accessed 2026-06-21 UTC.
- OpenAPI Specification 3.2.0 Opens in a new tab. — OpenAPI Initiative; Standards specification. Language-neutral interface descriptions for discovering and invoking HTTP APIs. Accessed 2026-06-21 UTC.
- JSON Schema Validation Vocabulary Opens in a new tab. — JSON Schema project; Standards specification. Structural assertions, annotations, and validation vocabulary for JSON instance data. Accessed 2026-06-21 UTC.
- Model Context Protocol Specification 2025-11-25 Opens in a new tab. — Model Context Protocol project; Official protocol specification. Protocol roles and requirements for connecting LLM applications to external data sources and tools. Accessed 2026-06-21 UTC.
- Model Context Protocol Tools Opens in a new tab. — Model Context Protocol project; Official protocol specification. Tool discovery and invocation metadata, including names and input schemas. Accessed 2026-06-21 UTC.
- Model Context Protocol Authorization Opens in a new tab. — Model Context Protocol project; Official protocol specification. HTTP transport authorization flow for clients acting on behalf of resource owners. Accessed 2026-06-21 UTC.
- OpenTelemetry GenAI Semantic Conventions Opens in a new tab. — OpenTelemetry project; Official project repository. Generative-AI spans, metrics, events, MCP conventions, and provider-specific telemetry conventions. Accessed 2026-06-21 UTC.
- Determinism (GPU) Opens in a new tab. — OpenXLA project; Official project documentation. GPU determinism considerations and the distinction between deterministic execution and performance trade-offs. Accessed 2026-06-21 UTC.
Last reviewed: 2026-06-21 UTC
