The aRuntime glossary defines runtime terms across compiler pipelines, graph execution, LLM serving, edge/browser deployment, agentic systems, governance, and benchmarking.
Agentic runtime
Agentic runtime
Definition: A production execution layer for agentic work that coordinates context, tools, memory, policy, state, and traceability.
A production execution layer for agentic work that coordinates context, tools, memory, policy, state, and traceability. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Tool broker
Definition: A agentic runtime concept used when designing, implementing, or operating AI runtimes.
A agentic runtime concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Tool contract
Definition: A agentic runtime concept used when designing, implementing, or operating AI runtimes.
A agentic runtime concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Side-effect classification
Definition: A agentic runtime concept used when designing, implementing, or operating AI runtimes.
A agentic runtime concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Idempotency
Definition: A agentic runtime concept used when designing, implementing, or operating AI runtimes.
A agentic runtime concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Memory manager
Definition: A agentic runtime concept used when designing, implementing, or operating AI runtimes.
A agentic runtime concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Working memory
Definition: A agentic runtime concept used when designing, implementing, or operating AI runtimes.
A agentic runtime concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Long-term memory
Definition: A agentic runtime concept used when designing, implementing, or operating AI runtimes.
A agentic runtime concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Context assembly
Definition: A agentic runtime concept used when designing, implementing, or operating AI runtimes.
A agentic runtime concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Human review
Definition: A agentic runtime concept used when designing, implementing, or operating AI runtimes.
A agentic runtime concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Rollback
Definition: A agentic runtime concept used when designing, implementing, or operating AI runtimes.
A agentic runtime concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Compensation
Definition: A agentic runtime concept used when designing, implementing, or operating AI runtimes.
A agentic runtime concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Evaluation envelope
Definition: A agentic runtime concept used when designing, implementing, or operating AI runtimes.
A agentic runtime concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Runtime contract
Definition: A agentic runtime concept used when designing, implementing, or operating AI runtimes.
A agentic runtime concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Compiler and IR terms
Intermediate representation
Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.
A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
ONNX
Definition: An open model graph format used to exchange models between frameworks and inference runtimes.
An open model graph format used to exchange models between frameworks and inference runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
StableHLO
Definition: A portable high-level operation set used in compiler workflows around OpenXLA-compatible systems.
A portable high-level operation set used in compiler workflows around OpenXLA-compatible systems. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
MLIR
Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.
A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
TOSA
Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.
A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
HLO
Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.
A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
PTX
Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.
A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
SPIR-V
Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.
A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Tracing
Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.
A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Scripting
Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.
A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
JIT compilation
Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.
A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
AOT compilation
Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.
A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Lowering
Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.
A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Code generation
Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.
A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Shape guard
Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.
A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Dynamic shape
Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.
A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Graph partitioning
Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.
A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Execution provider
Definition: A backend abstraction used to run supported graph partitions on specific hardware or libraries.
A backend abstraction used to run supported graph partitions on specific hardware or libraries. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Delegate
Definition: A mobile or edge backend that accelerates supported operations on a target processor.
A mobile or edge backend that accelerates supported operations on a target processor. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
BYOC
Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.
A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Deployment
Model repository
Definition: A deployment concept used when designing, implementing, or operating AI runtimes.
A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Model versioning
Definition: A deployment concept used when designing, implementing, or operating AI runtimes.
A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Canary rollout
Definition: A deployment concept used when designing, implementing, or operating AI runtimes.
A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Blue-green deployment
Definition: A deployment concept used when designing, implementing, or operating AI runtimes.
A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Autoscaling
Definition: A deployment concept used when designing, implementing, or operating AI runtimes.
A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Cold start
Definition: A deployment concept used when designing, implementing, or operating AI runtimes.
A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Warmup
Definition: A deployment concept used when designing, implementing, or operating AI runtimes.
A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Scale to zero
Definition: A deployment concept used when designing, implementing, or operating AI runtimes.
A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Kubernetes
Definition: A deployment concept used when designing, implementing, or operating AI runtimes.
A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Serverless runtime
Definition: A deployment concept used when designing, implementing, or operating AI runtimes.
A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
MicroVM
Definition: A deployment concept used when designing, implementing, or operating AI runtimes.
A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Private cloud runtime
Definition: A deployment concept used when designing, implementing, or operating AI runtimes.
A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Managed cloud runtime
Definition: A deployment concept used when designing, implementing, or operating AI runtimes.
A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Air-gapped runtime
Definition: A deployment concept used when designing, implementing, or operating AI runtimes.
A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Hybrid runtime
Definition: A deployment concept used when designing, implementing, or operating AI runtimes.
A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Local runtime
Definition: A deployment concept used when designing, implementing, or operating AI runtimes.
A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Distributed inference
Tensor parallelism
Definition: A distributed inference concept used when designing, implementing, or operating AI runtimes.
A distributed inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Pipeline parallelism
Definition: A distributed inference concept used when designing, implementing, or operating AI runtimes.
A distributed inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Data parallelism
Definition: A distributed inference concept used when designing, implementing, or operating AI runtimes.
A distributed inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Expert parallelism
Definition: A distributed inference concept used when designing, implementing, or operating AI runtimes.
A distributed inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Sequence parallelism
Definition: A distributed inference concept used when designing, implementing, or operating AI runtimes.
A distributed inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Disaggregated serving
Definition: A distributed inference concept used when designing, implementing, or operating AI runtimes.
A distributed inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Prefill worker
Definition: A distributed inference concept used when designing, implementing, or operating AI runtimes.
A distributed inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Decode worker
Definition: A distributed inference concept used when designing, implementing, or operating AI runtimes.
A distributed inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Collective communication
Definition: A distributed inference concept used when designing, implementing, or operating AI runtimes.
A distributed inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Interconnect
Definition: A distributed inference concept used when designing, implementing, or operating AI runtimes.
A distributed inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Elasticity
Definition: A distributed inference concept used when designing, implementing, or operating AI runtimes.
A distributed inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Rebalancing
Definition: A distributed inference concept used when designing, implementing, or operating AI runtimes.
A distributed inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Edge/mobile/browser
Edge runtime
Definition: A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes.
A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
On-device runtime
Definition: A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes.
A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
TinyML
Definition: A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes.
A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Mobile delegate
Definition: A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes.
A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Browser runtime
Definition: A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes.
A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
WebAssembly
Definition: A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes.
A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
WebGPU
Definition: A web API exposing GPU compute capabilities to browser applications.
A web API exposing GPU compute capabilities to browser applications. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
WebNN
Definition: A web API for constructing and executing neural network graphs using operating system and hardware capabilities.
A web API for constructing and executing neural network graphs using operating system and hardware capabilities. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Worker
Definition: A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes.
A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
IndexedDB model cache
Definition: A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes.
A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Progressive enhancement
Definition: A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes.
A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
NPU delegate
Definition: A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes.
A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Thermal throttling
Definition: A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes.
A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Offline inference
Definition: A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes.
A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Graph optimization
Constant folding
Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.
A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Dead-code elimination
Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.
A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Common-subexpression elimination
Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.
A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Algebraic simplification
Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.
A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Operator fusion
Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.
A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Kernel fusion
Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.
A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Epilogue fusion
Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.
A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Loop tiling
Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.
A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Loop unrolling
Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.
A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Vectorization
Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.
A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Memory layout transformation
Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.
A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Buffer reuse
Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.
A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Memory planning
Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.
A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Autotuning
Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.
A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Schedule search
Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.
A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Quantization
Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.
A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Post-training quantization
Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.
A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Quantization-aware training
Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.
A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Mixed precision
Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.
A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Sparsity
Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.
A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Pruning
Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.
A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Palletization
Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.
A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Hardware and kernels
CPU
Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.
A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
GPU
Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.
A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
TPU
Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.
A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
NPU
Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.
A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
DSP
Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.
A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
FPGA
Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.
A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Custom accelerator
Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.
A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
SIMD
Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.
A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
AVX
Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.
A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
NEON
Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.
A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Tensor core
Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.
A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Shared memory
Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.
A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
HBM
Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.
A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Memory bandwidth
Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.
A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Arithmetic intensity
Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.
A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Collective operation
Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.
A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Kernel launch overhead
Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.
A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
CUDA graph
Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.
A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
LLM inference
Transformer inference
Definition: A llm inference concept used when designing, implementing, or operating AI runtimes.
A llm inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Prefill
Definition: A llm inference concept used when designing, implementing, or operating AI runtimes.
A llm inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Decode
Definition: A llm inference concept used when designing, implementing, or operating AI runtimes.
A llm inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Autoregressive generation
Definition: A llm inference concept used when designing, implementing, or operating AI runtimes.
A llm inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Attention kernel
Definition: A llm inference concept used when designing, implementing, or operating AI runtimes.
A llm inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
KV cache
Definition: Stored key and value tensors used to avoid recomputing attention over previous tokens.
Stored key and value tensors used to avoid recomputing attention over previous tokens. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Context length
Definition: A llm inference concept used when designing, implementing, or operating AI runtimes.
A llm inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Streaming generation
Definition: A llm inference concept used when designing, implementing, or operating AI runtimes.
A llm inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
TTFT
Definition: Time to first token, a key interactive latency metric for streaming LLM responses.
Time to first token, a key interactive latency metric for streaming LLM responses. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
TPOT
Definition: Time per output token, a decode-phase latency metric.
Time per output token, a decode-phase latency metric. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Structured generation
Definition: A llm inference concept used when designing, implementing, or operating AI runtimes.
A llm inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Constrained decoding
Definition: A llm inference concept used when designing, implementing, or operating AI runtimes.
A llm inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Speculative decoding
Definition: Using a faster draft path and target verification to reduce sequential generation steps.
Using a faster draft path and target verification to reduce sequential generation steps. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Draft model
Definition: A llm inference concept used when designing, implementing, or operating AI runtimes.
A llm inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Acceptance rate
Definition: A llm inference concept used when designing, implementing, or operating AI runtimes.
A llm inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Medusa
Definition: A llm inference concept used when designing, implementing, or operating AI runtimes.
A llm inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
EAGLE
Definition: A llm inference concept used when designing, implementing, or operating AI runtimes.
A llm inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Long-context runtime
Definition: A llm inference concept used when designing, implementing, or operating AI runtimes.
A llm inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Memory and KV cache
PagedAttention
Definition: A KV-cache paging technique that stores sequence blocks non-contiguously while preserving attention semantics.
A KV-cache paging technique that stores sequence blocks non-contiguously while preserving attention semantics. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Prefix caching
Definition: A memory and kv cache concept used when designing, implementing, or operating AI runtimes.
A memory and kv cache concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
RadixAttention
Definition: An SGLang technique that organizes reusable prefixes in a radix tree for efficient KV reuse.
An SGLang technique that organizes reusable prefixes in a radix tree for efficient KV reuse. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Cache eviction
Definition: A memory and kv cache concept used when designing, implementing, or operating AI runtimes.
A memory and kv cache concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
KV offload
Definition: A memory and kv cache concept used when designing, implementing, or operating AI runtimes.
A memory and kv cache concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
KV transfer
Definition: A memory and kv cache concept used when designing, implementing, or operating AI runtimes.
A memory and kv cache concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Block table
Definition: A memory and kv cache concept used when designing, implementing, or operating AI runtimes.
A memory and kv cache concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Internal fragmentation
Definition: A memory and kv cache concept used when designing, implementing, or operating AI runtimes.
A memory and kv cache concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
External fragmentation
Definition: A memory and kv cache concept used when designing, implementing, or operating AI runtimes.
A memory and kv cache concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
MQA
Definition: A memory and kv cache concept used when designing, implementing, or operating AI runtimes.
A memory and kv cache concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
GQA
Definition: A memory and kv cache concept used when designing, implementing, or operating AI runtimes.
A memory and kv cache concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Cache-aware routing
Definition: A memory and kv cache concept used when designing, implementing, or operating AI runtimes.
A memory and kv cache concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Prefix reuse
Definition: A memory and kv cache concept used when designing, implementing, or operating AI runtimes.
A memory and kv cache concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Memory pressure
Definition: A memory and kv cache concept used when designing, implementing, or operating AI runtimes.
A memory and kv cache concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Observability and benchmarking
Runtime trace
Definition: A structured record of runtime spans, events, decisions, timing, costs, and references.
A structured record of runtime spans, events, decisions, timing, costs, and references. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Span
Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.
A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Replay
Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.
A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Trace waterfall
Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.
A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Queue time
Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.
A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Model time
Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.
A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Tool time
Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.
A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Retrieval time
Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.
A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Token count
Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.
A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Tail latency
Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.
A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Throughput
Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.
A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Cost per successful task
Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.
A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Benchmark fixture
Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.
A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Warmup window
Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.
A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Measurement window
Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.
A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Power per inference
Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.
A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Quality regression
Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.
A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Runtime fundamentals
AI runtime
Definition: The execution environment that turns model artifacts and requests into reliable AI-enabled behavior.
The execution environment that turns model artifacts and requests into reliable AI-enabled behavior. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Inference runtime
Definition: A runtime fundamentals concept used when designing, implementing, or operating AI runtimes.
A runtime fundamentals concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Inference engine
Definition: A system optimized to load trained models and run inference on one or more hardware targets.
A system optimized to load trained models and run inference on one or more hardware targets. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Model server
Definition: A service layer that exposes models through APIs and manages scheduling, repositories, health, and batching.
A service layer that exposes models through APIs and manages scheduling, repositories, health, and batching. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Serving platform
Definition: A runtime fundamentals concept used when designing, implementing, or operating AI runtimes.
A runtime fundamentals concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Compiler runtime
Definition: A runtime fundamentals concept used when designing, implementing, or operating AI runtimes.
A runtime fundamentals concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Graph runtime
Definition: A runtime fundamentals concept used when designing, implementing, or operating AI runtimes.
A runtime fundamentals concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Framework runtime
Definition: A runtime fundamentals concept used when designing, implementing, or operating AI runtimes.
A runtime fundamentals concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Runtime infrastructure
Definition: A runtime fundamentals concept used when designing, implementing, or operating AI runtimes.
A runtime fundamentals concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Adapter
Definition: A runtime fundamentals concept used when designing, implementing, or operating AI runtimes.
A runtime fundamentals concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Execution boundary
Definition: A runtime fundamentals concept used when designing, implementing, or operating AI runtimes.
A runtime fundamentals concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Model lifecycle
Definition: A runtime fundamentals concept used when designing, implementing, or operating AI runtimes.
A runtime fundamentals concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Request lifecycle
Definition: A runtime fundamentals concept used when designing, implementing, or operating AI runtimes.
A runtime fundamentals concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Scheduling and batching
Static batching
Definition: A scheduling and batching concept used when designing, implementing, or operating AI runtimes.
A scheduling and batching concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Dynamic batching
Definition: A scheduling and batching concept used when designing, implementing, or operating AI runtimes.
A scheduling and batching concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Continuous batching
Definition: Allowing new requests to enter an active LLM decoding batch as others finish.
Allowing new requests to enter an active LLM decoding batch as others finish. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
In-flight batching
Definition: A scheduling and batching concept used when designing, implementing, or operating AI runtimes.
A scheduling and batching concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Chunked prefill
Definition: A scheduling and batching concept used when designing, implementing, or operating AI runtimes.
A scheduling and batching concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Admission control
Definition: A scheduling and batching concept used when designing, implementing, or operating AI runtimes.
A scheduling and batching concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
SLO-aware scheduling
Definition: A scheduling and batching concept used when designing, implementing, or operating AI runtimes.
A scheduling and batching concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Priority queue
Definition: A scheduling and batching concept used when designing, implementing, or operating AI runtimes.
A scheduling and batching concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Fairness
Definition: A scheduling and batching concept used when designing, implementing, or operating AI runtimes.
A scheduling and batching concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Backpressure
Definition: A scheduling and batching concept used when designing, implementing, or operating AI runtimes.
A scheduling and batching concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Cancellation
Definition: A scheduling and batching concept used when designing, implementing, or operating AI runtimes.
A scheduling and batching concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Head-of-line blocking
Definition: A scheduling and batching concept used when designing, implementing, or operating AI runtimes.
A scheduling and batching concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Multi-tenant quota
Definition: A scheduling and batching concept used when designing, implementing, or operating AI runtimes.
A scheduling and batching concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Security and governance
Prompt injection
Definition: Untrusted content attempting to override instructions, leak data, or induce unsafe tool behavior.
Untrusted content attempting to override instructions, leak data, or induce unsafe tool behavior. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Confused deputy
Definition: A failure where the model or runtime uses its authority on behalf of an untrusted input.
A failure where the model or runtime uses its authority on behalf of an untrusted input. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Policy decision point
Definition: A security and governance concept used when designing, implementing, or operating AI runtimes.
A security and governance concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Policy enforcement point
Definition: A security and governance concept used when designing, implementing, or operating AI runtimes.
A security and governance concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Redaction
Definition: A security and governance concept used when designing, implementing, or operating AI runtimes.
A security and governance concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Data classification
Definition: A security and governance concept used when designing, implementing, or operating AI runtimes.
A security and governance concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Model artifact integrity
Definition: A security and governance concept used when designing, implementing, or operating AI runtimes.
A security and governance concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Secrets handling
Definition: A security and governance concept used when designing, implementing, or operating AI runtimes.
A security and governance concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Sandboxing
Definition: A security and governance concept used when designing, implementing, or operating AI runtimes.
A security and governance concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Tenant isolation
Definition: A security and governance concept used when designing, implementing, or operating AI runtimes.
A security and governance concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Audit trail
Definition: A security and governance concept used when designing, implementing, or operating AI runtimes.
A security and governance concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Provenance
Definition: A security and governance concept used when designing, implementing, or operating AI runtimes.
A security and governance concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Policy checkpoint
Definition: A security and governance concept used when designing, implementing, or operating AI runtimes.
A security and governance concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Approval gate
Definition: A security and governance concept used when designing, implementing, or operating AI runtimes.
A security and governance concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.
Sources and further reading
- ONNX Runtime Architecture — ONNX Runtime; official docs; accessed 2026-06-21 UTC.
- StableHLO Specification — OpenXLA; official specification; accessed 2026-06-21 UTC.
- vLLM Documentation — vLLM; official docs; accessed 2026-06-21 UTC.
- RadixAttention – SGLang — SGLang; official docs; accessed 2026-06-21 UTC.
- NVIDIA Triton Inference Server Architecture — NVIDIA; official docs; accessed 2026-06-21 UTC.
- Web Neural Network API — W3C; standard; accessed 2026-06-21 UTC.
- ExecuTorch Documentation — PyTorch; official docs; accessed 2026-06-21 UTC.
- LiteRT Documentation — Google AI Edge; official docs; accessed 2026-06-21 UTC.
Last reviewed: .
