Search ARuntime.com

Find runtime definitions and implementation guidance

Search page titles, summaries, headings, glossary terms, use cases, and runtime-directory entries.

Enter at least two characters.

Foundations

AI Runtime Glossary

A searchable glossary of AI runtime terminology across compilers, inference engines, serving systems, edge/browser runtimes, agentic runtimes, security, and observability.

Audience: Technical readers Reading time: 55 minutes Status: Foundational Last reviewed:

The aRuntime glossary defines runtime terms across compiler pipelines, graph execution, LLM serving, edge/browser deployment, agentic systems, governance, and benchmarking.

Agentic runtime

Agentic runtime

Definition: A production execution layer for agentic work that coordinates context, tools, memory, policy, state, and traceability.

A production execution layer for agentic work that coordinates context, tools, memory, policy, state, and traceability. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Agentic runtime

Related: Tool broker, Tool contract, Side-effect classification

Tool broker

Definition: A agentic runtime concept used when designing, implementing, or operating AI runtimes.

A agentic runtime concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Agentic runtime

Related: Agentic runtime, Tool contract, Side-effect classification

Tool contract

Definition: A agentic runtime concept used when designing, implementing, or operating AI runtimes.

A agentic runtime concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Agentic runtime

Related: Agentic runtime, Tool broker, Side-effect classification

Side-effect classification

Definition: A agentic runtime concept used when designing, implementing, or operating AI runtimes.

A agentic runtime concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Agentic runtime

Related: Agentic runtime, Tool broker, Tool contract

Idempotency

Definition: A agentic runtime concept used when designing, implementing, or operating AI runtimes.

A agentic runtime concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Agentic runtime

Related: Agentic runtime, Tool broker, Tool contract

Memory manager

Definition: A agentic runtime concept used when designing, implementing, or operating AI runtimes.

A agentic runtime concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Agentic runtime

Related: Agentic runtime, Tool broker, Tool contract

Working memory

Definition: A agentic runtime concept used when designing, implementing, or operating AI runtimes.

A agentic runtime concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Agentic runtime

Related: Agentic runtime, Tool broker, Tool contract

Long-term memory

Definition: A agentic runtime concept used when designing, implementing, or operating AI runtimes.

A agentic runtime concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Agentic runtime

Related: Agentic runtime, Tool broker, Tool contract

Context assembly

Definition: A agentic runtime concept used when designing, implementing, or operating AI runtimes.

A agentic runtime concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Agentic runtime

Related: Agentic runtime, Tool broker, Tool contract

Human review

Definition: A agentic runtime concept used when designing, implementing, or operating AI runtimes.

A agentic runtime concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Agentic runtime

Related: Agentic runtime, Tool broker, Tool contract

Rollback

Definition: A agentic runtime concept used when designing, implementing, or operating AI runtimes.

A agentic runtime concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Agentic runtime

Related: Agentic runtime, Tool broker, Tool contract

Compensation

Definition: A agentic runtime concept used when designing, implementing, or operating AI runtimes.

A agentic runtime concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Agentic runtime

Related: Agentic runtime, Tool broker, Tool contract

Evaluation envelope

Definition: A agentic runtime concept used when designing, implementing, or operating AI runtimes.

A agentic runtime concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Agentic runtime

Related: Agentic runtime, Tool broker, Tool contract

Runtime contract

Definition: A agentic runtime concept used when designing, implementing, or operating AI runtimes.

A agentic runtime concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Agentic runtime

Related: Agentic runtime, Tool broker, Tool contract

Compiler and IR terms

Intermediate representation

Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.

A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Compiler and IR terms

Related: ONNX, StableHLO, MLIR

ONNX

Definition: An open model graph format used to exchange models between frameworks and inference runtimes.

An open model graph format used to exchange models between frameworks and inference runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Compiler and IR terms

Related: Intermediate representation, StableHLO, MLIR

StableHLO

Definition: A portable high-level operation set used in compiler workflows around OpenXLA-compatible systems.

A portable high-level operation set used in compiler workflows around OpenXLA-compatible systems. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Compiler and IR terms

Related: Intermediate representation, ONNX, MLIR

MLIR

Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.

A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Compiler and IR terms

Related: Intermediate representation, ONNX, StableHLO

TOSA

Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.

A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Compiler and IR terms

Related: Intermediate representation, ONNX, StableHLO

HLO

Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.

A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Compiler and IR terms

Related: Intermediate representation, ONNX, StableHLO

PTX

Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.

A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Compiler and IR terms

Related: Intermediate representation, ONNX, StableHLO

SPIR-V

Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.

A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Compiler and IR terms

Related: Intermediate representation, ONNX, StableHLO

Tracing

Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.

A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Compiler and IR terms

Related: Intermediate representation, ONNX, StableHLO

Scripting

Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.

A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Compiler and IR terms

Related: Intermediate representation, ONNX, StableHLO

JIT compilation

Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.

A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Compiler and IR terms

Related: Intermediate representation, ONNX, StableHLO

AOT compilation

Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.

A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Compiler and IR terms

Related: Intermediate representation, ONNX, StableHLO

Lowering

Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.

A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Compiler and IR terms

Related: Intermediate representation, ONNX, StableHLO

Code generation

Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.

A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Compiler and IR terms

Related: Intermediate representation, ONNX, StableHLO

Shape guard

Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.

A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Compiler and IR terms

Related: Intermediate representation, ONNX, StableHLO

Dynamic shape

Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.

A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Compiler and IR terms

Related: Intermediate representation, ONNX, StableHLO

Graph partitioning

Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.

A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Compiler and IR terms

Related: Intermediate representation, ONNX, StableHLO

Execution provider

Definition: A backend abstraction used to run supported graph partitions on specific hardware or libraries.

A backend abstraction used to run supported graph partitions on specific hardware or libraries. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Compiler and IR terms

Related: Intermediate representation, ONNX, StableHLO

Delegate

Definition: A mobile or edge backend that accelerates supported operations on a target processor.

A mobile or edge backend that accelerates supported operations on a target processor. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Compiler and IR terms

Related: Intermediate representation, ONNX, StableHLO

BYOC

Definition: A compiler and ir terms concept used when designing, implementing, or operating AI runtimes.

A compiler and ir terms concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Compiler and IR terms

Related: Intermediate representation, ONNX, StableHLO

Deployment

Model repository

Definition: A deployment concept used when designing, implementing, or operating AI runtimes.

A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Deployment

Related: Model versioning, Canary rollout, Blue-green deployment

Model versioning

Definition: A deployment concept used when designing, implementing, or operating AI runtimes.

A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Deployment

Related: Model repository, Canary rollout, Blue-green deployment

Canary rollout

Definition: A deployment concept used when designing, implementing, or operating AI runtimes.

A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Deployment

Related: Model repository, Model versioning, Blue-green deployment

Blue-green deployment

Definition: A deployment concept used when designing, implementing, or operating AI runtimes.

A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Deployment

Related: Model repository, Model versioning, Canary rollout

Autoscaling

Definition: A deployment concept used when designing, implementing, or operating AI runtimes.

A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Deployment

Related: Model repository, Model versioning, Canary rollout

Cold start

Definition: A deployment concept used when designing, implementing, or operating AI runtimes.

A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Deployment

Related: Model repository, Model versioning, Canary rollout

Warmup

Definition: A deployment concept used when designing, implementing, or operating AI runtimes.

A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Deployment

Related: Model repository, Model versioning, Canary rollout

Scale to zero

Definition: A deployment concept used when designing, implementing, or operating AI runtimes.

A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Deployment

Related: Model repository, Model versioning, Canary rollout

Kubernetes

Definition: A deployment concept used when designing, implementing, or operating AI runtimes.

A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Deployment

Related: Model repository, Model versioning, Canary rollout

Serverless runtime

Definition: A deployment concept used when designing, implementing, or operating AI runtimes.

A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Deployment

Related: Model repository, Model versioning, Canary rollout

MicroVM

Definition: A deployment concept used when designing, implementing, or operating AI runtimes.

A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Deployment

Related: Model repository, Model versioning, Canary rollout

Private cloud runtime

Definition: A deployment concept used when designing, implementing, or operating AI runtimes.

A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Deployment

Related: Model repository, Model versioning, Canary rollout

Managed cloud runtime

Definition: A deployment concept used when designing, implementing, or operating AI runtimes.

A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Deployment

Related: Model repository, Model versioning, Canary rollout

Air-gapped runtime

Definition: A deployment concept used when designing, implementing, or operating AI runtimes.

A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Deployment

Related: Model repository, Model versioning, Canary rollout

Hybrid runtime

Definition: A deployment concept used when designing, implementing, or operating AI runtimes.

A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Deployment

Related: Model repository, Model versioning, Canary rollout

Local runtime

Definition: A deployment concept used when designing, implementing, or operating AI runtimes.

A deployment concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Deployment

Related: Model repository, Model versioning, Canary rollout

Distributed inference

Tensor parallelism

Definition: A distributed inference concept used when designing, implementing, or operating AI runtimes.

A distributed inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Distributed inference

Related: Pipeline parallelism, Data parallelism, Expert parallelism

Pipeline parallelism

Definition: A distributed inference concept used when designing, implementing, or operating AI runtimes.

A distributed inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Distributed inference

Related: Tensor parallelism, Data parallelism, Expert parallelism

Data parallelism

Definition: A distributed inference concept used when designing, implementing, or operating AI runtimes.

A distributed inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Distributed inference

Related: Tensor parallelism, Pipeline parallelism, Expert parallelism

Expert parallelism

Definition: A distributed inference concept used when designing, implementing, or operating AI runtimes.

A distributed inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Distributed inference

Related: Tensor parallelism, Pipeline parallelism, Data parallelism

Sequence parallelism

Definition: A distributed inference concept used when designing, implementing, or operating AI runtimes.

A distributed inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Distributed inference

Related: Tensor parallelism, Pipeline parallelism, Data parallelism

Disaggregated serving

Definition: A distributed inference concept used when designing, implementing, or operating AI runtimes.

A distributed inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Distributed inference

Related: Tensor parallelism, Pipeline parallelism, Data parallelism

Prefill worker

Definition: A distributed inference concept used when designing, implementing, or operating AI runtimes.

A distributed inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Distributed inference

Related: Tensor parallelism, Pipeline parallelism, Data parallelism

Decode worker

Definition: A distributed inference concept used when designing, implementing, or operating AI runtimes.

A distributed inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Distributed inference

Related: Tensor parallelism, Pipeline parallelism, Data parallelism

Collective communication

Definition: A distributed inference concept used when designing, implementing, or operating AI runtimes.

A distributed inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Distributed inference

Related: Tensor parallelism, Pipeline parallelism, Data parallelism

Interconnect

Definition: A distributed inference concept used when designing, implementing, or operating AI runtimes.

A distributed inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Distributed inference

Related: Tensor parallelism, Pipeline parallelism, Data parallelism

Elasticity

Definition: A distributed inference concept used when designing, implementing, or operating AI runtimes.

A distributed inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Distributed inference

Related: Tensor parallelism, Pipeline parallelism, Data parallelism

Rebalancing

Definition: A distributed inference concept used when designing, implementing, or operating AI runtimes.

A distributed inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Distributed inference

Related: Tensor parallelism, Pipeline parallelism, Data parallelism

Edge/mobile/browser

Edge runtime

Definition: A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes.

A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Edge/mobile/browser

Related: On-device runtime, TinyML, Mobile delegate

On-device runtime

Definition: A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes.

A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Edge/mobile/browser

Related: Edge runtime, TinyML, Mobile delegate

TinyML

Definition: A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes.

A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Edge/mobile/browser

Related: Edge runtime, On-device runtime, Mobile delegate

Mobile delegate

Definition: A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes.

A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Edge/mobile/browser

Related: Edge runtime, On-device runtime, TinyML

Browser runtime

Definition: A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes.

A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Edge/mobile/browser

Related: Edge runtime, On-device runtime, TinyML

WebAssembly

Definition: A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes.

A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Edge/mobile/browser

Related: Edge runtime, On-device runtime, TinyML

WebGPU

Definition: A web API exposing GPU compute capabilities to browser applications.

A web API exposing GPU compute capabilities to browser applications. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Edge/mobile/browser

Related: Edge runtime, On-device runtime, TinyML

WebNN

Definition: A web API for constructing and executing neural network graphs using operating system and hardware capabilities.

A web API for constructing and executing neural network graphs using operating system and hardware capabilities. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Edge/mobile/browser

Related: Edge runtime, On-device runtime, TinyML

Worker

Definition: A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes.

A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Edge/mobile/browser

Related: Edge runtime, On-device runtime, TinyML

IndexedDB model cache

Definition: A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes.

A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Edge/mobile/browser

Related: Edge runtime, On-device runtime, TinyML

Progressive enhancement

Definition: A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes.

A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Edge/mobile/browser

Related: Edge runtime, On-device runtime, TinyML

NPU delegate

Definition: A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes.

A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Edge/mobile/browser

Related: Edge runtime, On-device runtime, TinyML

Thermal throttling

Definition: A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes.

A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Edge/mobile/browser

Related: Edge runtime, On-device runtime, TinyML

Offline inference

Definition: A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes.

A edge/mobile/browser concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Edge/mobile/browser

Related: Edge runtime, On-device runtime, TinyML

Graph optimization

Constant folding

Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.

A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Graph optimization

Related: Dead-code elimination, Common-subexpression elimination, Algebraic simplification

Dead-code elimination

Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.

A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Graph optimization

Related: Constant folding, Common-subexpression elimination, Algebraic simplification

Common-subexpression elimination

Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.

A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Graph optimization

Related: Constant folding, Dead-code elimination, Algebraic simplification

Algebraic simplification

Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.

A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Graph optimization

Related: Constant folding, Dead-code elimination, Common-subexpression elimination

Operator fusion

Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.

A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Graph optimization

Related: Constant folding, Dead-code elimination, Common-subexpression elimination

Kernel fusion

Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.

A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Graph optimization

Related: Constant folding, Dead-code elimination, Common-subexpression elimination

Epilogue fusion

Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.

A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Graph optimization

Related: Constant folding, Dead-code elimination, Common-subexpression elimination

Loop tiling

Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.

A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Graph optimization

Related: Constant folding, Dead-code elimination, Common-subexpression elimination

Loop unrolling

Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.

A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Graph optimization

Related: Constant folding, Dead-code elimination, Common-subexpression elimination

Vectorization

Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.

A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Graph optimization

Related: Constant folding, Dead-code elimination, Common-subexpression elimination

Memory layout transformation

Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.

A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Graph optimization

Related: Constant folding, Dead-code elimination, Common-subexpression elimination

Buffer reuse

Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.

A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Graph optimization

Related: Constant folding, Dead-code elimination, Common-subexpression elimination

Memory planning

Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.

A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Graph optimization

Related: Constant folding, Dead-code elimination, Common-subexpression elimination

Autotuning

Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.

A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Graph optimization

Related: Constant folding, Dead-code elimination, Common-subexpression elimination

Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.

A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Graph optimization

Related: Constant folding, Dead-code elimination, Common-subexpression elimination

Quantization

Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.

A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Graph optimization

Related: Constant folding, Dead-code elimination, Common-subexpression elimination

Post-training quantization

Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.

A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Graph optimization

Related: Constant folding, Dead-code elimination, Common-subexpression elimination

Quantization-aware training

Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.

A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Graph optimization

Related: Constant folding, Dead-code elimination, Common-subexpression elimination

Mixed precision

Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.

A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Graph optimization

Related: Constant folding, Dead-code elimination, Common-subexpression elimination

Sparsity

Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.

A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Graph optimization

Related: Constant folding, Dead-code elimination, Common-subexpression elimination

Pruning

Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.

A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Graph optimization

Related: Constant folding, Dead-code elimination, Common-subexpression elimination

Palletization

Definition: A graph optimization concept used when designing, implementing, or operating AI runtimes.

A graph optimization concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Graph optimization

Related: Constant folding, Dead-code elimination, Common-subexpression elimination

Hardware and kernels

CPU

Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.

A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Hardware and kernels

Related: GPU, TPU, NPU

GPU

Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.

A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Hardware and kernels

Related: CPU, TPU, NPU

TPU

Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.

A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Hardware and kernels

Related: CPU, GPU, NPU

NPU

Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.

A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Hardware and kernels

Related: CPU, GPU, TPU

DSP

Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.

A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Hardware and kernels

Related: CPU, GPU, TPU

FPGA

Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.

A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Hardware and kernels

Related: CPU, GPU, TPU

Custom accelerator

Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.

A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Hardware and kernels

Related: CPU, GPU, TPU

SIMD

Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.

A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Hardware and kernels

Related: CPU, GPU, TPU

AVX

Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.

A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Hardware and kernels

Related: CPU, GPU, TPU

NEON

Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.

A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Hardware and kernels

Related: CPU, GPU, TPU

Tensor core

Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.

A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Hardware and kernels

Related: CPU, GPU, TPU

Shared memory

Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.

A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Hardware and kernels

Related: CPU, GPU, TPU

HBM

Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.

A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Hardware and kernels

Related: CPU, GPU, TPU

Memory bandwidth

Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.

A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Hardware and kernels

Related: CPU, GPU, TPU

Arithmetic intensity

Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.

A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Hardware and kernels

Related: CPU, GPU, TPU

Collective operation

Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.

A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Hardware and kernels

Related: CPU, GPU, TPU

Kernel launch overhead

Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.

A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Hardware and kernels

Related: CPU, GPU, TPU

CUDA graph

Definition: A hardware and kernels concept used when designing, implementing, or operating AI runtimes.

A hardware and kernels concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Hardware and kernels

Related: CPU, GPU, TPU

LLM inference

Transformer inference

Definition: A llm inference concept used when designing, implementing, or operating AI runtimes.

A llm inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: LLM inference

Related: Prefill, Decode, Autoregressive generation

Prefill

Definition: A llm inference concept used when designing, implementing, or operating AI runtimes.

A llm inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: LLM inference

Related: Transformer inference, Decode, Autoregressive generation

Decode

Definition: A llm inference concept used when designing, implementing, or operating AI runtimes.

A llm inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: LLM inference

Related: Transformer inference, Prefill, Autoregressive generation

Autoregressive generation

Definition: A llm inference concept used when designing, implementing, or operating AI runtimes.

A llm inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: LLM inference

Related: Transformer inference, Prefill, Decode

Attention kernel

Definition: A llm inference concept used when designing, implementing, or operating AI runtimes.

A llm inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: LLM inference

Related: Transformer inference, Prefill, Decode

KV cache

Definition: Stored key and value tensors used to avoid recomputing attention over previous tokens.

Stored key and value tensors used to avoid recomputing attention over previous tokens. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: LLM inference

Related: Transformer inference, Prefill, Decode

Context length

Definition: A llm inference concept used when designing, implementing, or operating AI runtimes.

A llm inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: LLM inference

Related: Transformer inference, Prefill, Decode

Streaming generation

Definition: A llm inference concept used when designing, implementing, or operating AI runtimes.

A llm inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: LLM inference

Related: Transformer inference, Prefill, Decode

TTFT

Definition: Time to first token, a key interactive latency metric for streaming LLM responses.

Time to first token, a key interactive latency metric for streaming LLM responses. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: LLM inference

Related: Transformer inference, Prefill, Decode

TPOT

Definition: Time per output token, a decode-phase latency metric.

Time per output token, a decode-phase latency metric. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: LLM inference

Related: Transformer inference, Prefill, Decode

Structured generation

Definition: A llm inference concept used when designing, implementing, or operating AI runtimes.

A llm inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: LLM inference

Related: Transformer inference, Prefill, Decode

Constrained decoding

Definition: A llm inference concept used when designing, implementing, or operating AI runtimes.

A llm inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: LLM inference

Related: Transformer inference, Prefill, Decode

Speculative decoding

Definition: Using a faster draft path and target verification to reduce sequential generation steps.

Using a faster draft path and target verification to reduce sequential generation steps. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: LLM inference

Related: Transformer inference, Prefill, Decode

Draft model

Definition: A llm inference concept used when designing, implementing, or operating AI runtimes.

A llm inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: LLM inference

Related: Transformer inference, Prefill, Decode

Acceptance rate

Definition: A llm inference concept used when designing, implementing, or operating AI runtimes.

A llm inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: LLM inference

Related: Transformer inference, Prefill, Decode

Medusa

Definition: A llm inference concept used when designing, implementing, or operating AI runtimes.

A llm inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: LLM inference

Related: Transformer inference, Prefill, Decode

EAGLE

Definition: A llm inference concept used when designing, implementing, or operating AI runtimes.

A llm inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: LLM inference

Related: Transformer inference, Prefill, Decode

Long-context runtime

Definition: A llm inference concept used when designing, implementing, or operating AI runtimes.

A llm inference concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: LLM inference

Related: Transformer inference, Prefill, Decode

Memory and KV cache

PagedAttention

Definition: A KV-cache paging technique that stores sequence blocks non-contiguously while preserving attention semantics.

A KV-cache paging technique that stores sequence blocks non-contiguously while preserving attention semantics. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Memory and KV cache

Related: Prefix caching, RadixAttention, Cache eviction

Prefix caching

Definition: A memory and kv cache concept used when designing, implementing, or operating AI runtimes.

A memory and kv cache concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Memory and KV cache

Related: PagedAttention, RadixAttention, Cache eviction

RadixAttention

Definition: An SGLang technique that organizes reusable prefixes in a radix tree for efficient KV reuse.

An SGLang technique that organizes reusable prefixes in a radix tree for efficient KV reuse. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Memory and KV cache

Related: PagedAttention, Prefix caching, Cache eviction

Cache eviction

Definition: A memory and kv cache concept used when designing, implementing, or operating AI runtimes.

A memory and kv cache concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Memory and KV cache

Related: PagedAttention, Prefix caching, RadixAttention

KV offload

Definition: A memory and kv cache concept used when designing, implementing, or operating AI runtimes.

A memory and kv cache concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Memory and KV cache

Related: PagedAttention, Prefix caching, RadixAttention

KV transfer

Definition: A memory and kv cache concept used when designing, implementing, or operating AI runtimes.

A memory and kv cache concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Memory and KV cache

Related: PagedAttention, Prefix caching, RadixAttention

Block table

Definition: A memory and kv cache concept used when designing, implementing, or operating AI runtimes.

A memory and kv cache concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Memory and KV cache

Related: PagedAttention, Prefix caching, RadixAttention

Internal fragmentation

Definition: A memory and kv cache concept used when designing, implementing, or operating AI runtimes.

A memory and kv cache concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Memory and KV cache

Related: PagedAttention, Prefix caching, RadixAttention

External fragmentation

Definition: A memory and kv cache concept used when designing, implementing, or operating AI runtimes.

A memory and kv cache concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Memory and KV cache

Related: PagedAttention, Prefix caching, RadixAttention

MQA

Definition: A memory and kv cache concept used when designing, implementing, or operating AI runtimes.

A memory and kv cache concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Memory and KV cache

Related: PagedAttention, Prefix caching, RadixAttention

GQA

Definition: A memory and kv cache concept used when designing, implementing, or operating AI runtimes.

A memory and kv cache concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Memory and KV cache

Related: PagedAttention, Prefix caching, RadixAttention

Cache-aware routing

Definition: A memory and kv cache concept used when designing, implementing, or operating AI runtimes.

A memory and kv cache concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Memory and KV cache

Related: PagedAttention, Prefix caching, RadixAttention

Prefix reuse

Definition: A memory and kv cache concept used when designing, implementing, or operating AI runtimes.

A memory and kv cache concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Memory and KV cache

Related: PagedAttention, Prefix caching, RadixAttention

Memory pressure

Definition: A memory and kv cache concept used when designing, implementing, or operating AI runtimes.

A memory and kv cache concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Memory and KV cache

Related: PagedAttention, Prefix caching, RadixAttention

Observability and benchmarking

Runtime trace

Definition: A structured record of runtime spans, events, decisions, timing, costs, and references.

A structured record of runtime spans, events, decisions, timing, costs, and references. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Observability and benchmarking

Related: Span, Replay, Trace waterfall

Span

Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.

A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Observability and benchmarking

Related: Runtime trace, Replay, Trace waterfall

Replay

Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.

A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Observability and benchmarking

Related: Runtime trace, Span, Trace waterfall

Trace waterfall

Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.

A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Observability and benchmarking

Related: Runtime trace, Span, Replay

Queue time

Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.

A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Observability and benchmarking

Related: Runtime trace, Span, Replay

Model time

Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.

A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Observability and benchmarking

Related: Runtime trace, Span, Replay

Tool time

Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.

A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Observability and benchmarking

Related: Runtime trace, Span, Replay

Retrieval time

Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.

A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Observability and benchmarking

Related: Runtime trace, Span, Replay

Token count

Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.

A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Observability and benchmarking

Related: Runtime trace, Span, Replay

Tail latency

Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.

A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Observability and benchmarking

Related: Runtime trace, Span, Replay

Throughput

Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.

A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Observability and benchmarking

Related: Runtime trace, Span, Replay

Cost per successful task

Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.

A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Observability and benchmarking

Related: Runtime trace, Span, Replay

Benchmark fixture

Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.

A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Observability and benchmarking

Related: Runtime trace, Span, Replay

Warmup window

Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.

A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Observability and benchmarking

Related: Runtime trace, Span, Replay

Measurement window

Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.

A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Observability and benchmarking

Related: Runtime trace, Span, Replay

Power per inference

Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.

A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Observability and benchmarking

Related: Runtime trace, Span, Replay

Quality regression

Definition: A observability and benchmarking concept used when designing, implementing, or operating AI runtimes.

A observability and benchmarking concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Observability and benchmarking

Related: Runtime trace, Span, Replay

Runtime fundamentals

AI runtime

Definition: The execution environment that turns model artifacts and requests into reliable AI-enabled behavior.

The execution environment that turns model artifacts and requests into reliable AI-enabled behavior. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Runtime fundamentals

Related: Inference runtime, Inference engine, Model server

Inference runtime

Definition: A runtime fundamentals concept used when designing, implementing, or operating AI runtimes.

A runtime fundamentals concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Runtime fundamentals

Related: AI runtime, Inference engine, Model server

Inference engine

Definition: A system optimized to load trained models and run inference on one or more hardware targets.

A system optimized to load trained models and run inference on one or more hardware targets. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Runtime fundamentals

Related: AI runtime, Inference runtime, Model server

Model server

Definition: A service layer that exposes models through APIs and manages scheduling, repositories, health, and batching.

A service layer that exposes models through APIs and manages scheduling, repositories, health, and batching. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Runtime fundamentals

Related: AI runtime, Inference runtime, Inference engine

Serving platform

Definition: A runtime fundamentals concept used when designing, implementing, or operating AI runtimes.

A runtime fundamentals concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Runtime fundamentals

Related: AI runtime, Inference runtime, Inference engine

Compiler runtime

Definition: A runtime fundamentals concept used when designing, implementing, or operating AI runtimes.

A runtime fundamentals concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Runtime fundamentals

Related: AI runtime, Inference runtime, Inference engine

Graph runtime

Definition: A runtime fundamentals concept used when designing, implementing, or operating AI runtimes.

A runtime fundamentals concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Runtime fundamentals

Related: AI runtime, Inference runtime, Inference engine

Framework runtime

Definition: A runtime fundamentals concept used when designing, implementing, or operating AI runtimes.

A runtime fundamentals concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Runtime fundamentals

Related: AI runtime, Inference runtime, Inference engine

Runtime infrastructure

Definition: A runtime fundamentals concept used when designing, implementing, or operating AI runtimes.

A runtime fundamentals concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Runtime fundamentals

Related: AI runtime, Inference runtime, Inference engine

Adapter

Definition: A runtime fundamentals concept used when designing, implementing, or operating AI runtimes.

A runtime fundamentals concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Runtime fundamentals

Related: AI runtime, Inference runtime, Inference engine

Execution boundary

Definition: A runtime fundamentals concept used when designing, implementing, or operating AI runtimes.

A runtime fundamentals concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Runtime fundamentals

Related: AI runtime, Inference runtime, Inference engine

Model lifecycle

Definition: A runtime fundamentals concept used when designing, implementing, or operating AI runtimes.

A runtime fundamentals concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Runtime fundamentals

Related: AI runtime, Inference runtime, Inference engine

Request lifecycle

Definition: A runtime fundamentals concept used when designing, implementing, or operating AI runtimes.

A runtime fundamentals concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Runtime fundamentals

Related: AI runtime, Inference runtime, Inference engine

Scheduling and batching

Static batching

Definition: A scheduling and batching concept used when designing, implementing, or operating AI runtimes.

A scheduling and batching concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Scheduling and batching

Related: Dynamic batching, Continuous batching, In-flight batching

Dynamic batching

Definition: A scheduling and batching concept used when designing, implementing, or operating AI runtimes.

A scheduling and batching concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Scheduling and batching

Related: Static batching, Continuous batching, In-flight batching

Continuous batching

Definition: Allowing new requests to enter an active LLM decoding batch as others finish.

Allowing new requests to enter an active LLM decoding batch as others finish. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Scheduling and batching

Related: Static batching, Dynamic batching, In-flight batching

In-flight batching

Definition: A scheduling and batching concept used when designing, implementing, or operating AI runtimes.

A scheduling and batching concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Scheduling and batching

Related: Static batching, Dynamic batching, Continuous batching

Chunked prefill

Definition: A scheduling and batching concept used when designing, implementing, or operating AI runtimes.

A scheduling and batching concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Scheduling and batching

Related: Static batching, Dynamic batching, Continuous batching

Admission control

Definition: A scheduling and batching concept used when designing, implementing, or operating AI runtimes.

A scheduling and batching concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Scheduling and batching

Related: Static batching, Dynamic batching, Continuous batching

SLO-aware scheduling

Definition: A scheduling and batching concept used when designing, implementing, or operating AI runtimes.

A scheduling and batching concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Scheduling and batching

Related: Static batching, Dynamic batching, Continuous batching

Priority queue

Definition: A scheduling and batching concept used when designing, implementing, or operating AI runtimes.

A scheduling and batching concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Scheduling and batching

Related: Static batching, Dynamic batching, Continuous batching

Fairness

Definition: A scheduling and batching concept used when designing, implementing, or operating AI runtimes.

A scheduling and batching concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Scheduling and batching

Related: Static batching, Dynamic batching, Continuous batching

Backpressure

Definition: A scheduling and batching concept used when designing, implementing, or operating AI runtimes.

A scheduling and batching concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Scheduling and batching

Related: Static batching, Dynamic batching, Continuous batching

Cancellation

Definition: A scheduling and batching concept used when designing, implementing, or operating AI runtimes.

A scheduling and batching concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Scheduling and batching

Related: Static batching, Dynamic batching, Continuous batching

Head-of-line blocking

Definition: A scheduling and batching concept used when designing, implementing, or operating AI runtimes.

A scheduling and batching concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Scheduling and batching

Related: Static batching, Dynamic batching, Continuous batching

Multi-tenant quota

Definition: A scheduling and batching concept used when designing, implementing, or operating AI runtimes.

A scheduling and batching concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Scheduling and batching

Related: Static batching, Dynamic batching, Continuous batching

Security and governance

Prompt injection

Definition: Untrusted content attempting to override instructions, leak data, or induce unsafe tool behavior.

Untrusted content attempting to override instructions, leak data, or induce unsafe tool behavior. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Security and governance

Related: Confused deputy, Policy decision point, Policy enforcement point

Confused deputy

Definition: A failure where the model or runtime uses its authority on behalf of an untrusted input.

A failure where the model or runtime uses its authority on behalf of an untrusted input. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Security and governance

Related: Prompt injection, Policy decision point, Policy enforcement point

Policy decision point

Definition: A security and governance concept used when designing, implementing, or operating AI runtimes.

A security and governance concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Security and governance

Related: Prompt injection, Confused deputy, Policy enforcement point

Policy enforcement point

Definition: A security and governance concept used when designing, implementing, or operating AI runtimes.

A security and governance concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Security and governance

Related: Prompt injection, Confused deputy, Policy decision point

Redaction

Definition: A security and governance concept used when designing, implementing, or operating AI runtimes.

A security and governance concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Security and governance

Related: Prompt injection, Confused deputy, Policy decision point

Data classification

Definition: A security and governance concept used when designing, implementing, or operating AI runtimes.

A security and governance concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Security and governance

Related: Prompt injection, Confused deputy, Policy decision point

Model artifact integrity

Definition: A security and governance concept used when designing, implementing, or operating AI runtimes.

A security and governance concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Security and governance

Related: Prompt injection, Confused deputy, Policy decision point

Secrets handling

Definition: A security and governance concept used when designing, implementing, or operating AI runtimes.

A security and governance concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Security and governance

Related: Prompt injection, Confused deputy, Policy decision point

Sandboxing

Definition: A security and governance concept used when designing, implementing, or operating AI runtimes.

A security and governance concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Security and governance

Related: Prompt injection, Confused deputy, Policy decision point

Tenant isolation

Definition: A security and governance concept used when designing, implementing, or operating AI runtimes.

A security and governance concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Security and governance

Related: Prompt injection, Confused deputy, Policy decision point

Audit trail

Definition: A security and governance concept used when designing, implementing, or operating AI runtimes.

A security and governance concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Security and governance

Related: Prompt injection, Confused deputy, Policy decision point

Provenance

Definition: A security and governance concept used when designing, implementing, or operating AI runtimes.

A security and governance concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Security and governance

Related: Prompt injection, Confused deputy, Policy decision point

Policy checkpoint

Definition: A security and governance concept used when designing, implementing, or operating AI runtimes.

A security and governance concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Security and governance

Related: Prompt injection, Confused deputy, Policy decision point

Approval gate

Definition: A security and governance concept used when designing, implementing, or operating AI runtimes.

A security and governance concept used when designing, implementing, or operating AI runtimes. In the aRuntime.com taxonomy, this term should be interpreted by layer, workload, deployment boundary, and source context.

Where it fits: Security and governance

Related: Prompt injection, Confused deputy, Policy decision point

Sources and further reading

  1. ONNX Runtime Architecture — ONNX Runtime; official docs; accessed 2026-06-21 UTC.
  2. StableHLO Specification — OpenXLA; official specification; accessed 2026-06-21 UTC.
  3. vLLM Documentation — vLLM; official docs; accessed 2026-06-21 UTC.
  4. RadixAttention – SGLang — SGLang; official docs; accessed 2026-06-21 UTC.
  5. NVIDIA Triton Inference Server Architecture — NVIDIA; official docs; accessed 2026-06-21 UTC.
  6. Web Neural Network API — W3C; standard; accessed 2026-06-21 UTC.
  7. ExecuTorch Documentation — PyTorch; official docs; accessed 2026-06-21 UTC.
  8. LiteRT Documentation — Google AI Edge; official docs; accessed 2026-06-21 UTC.

Last reviewed: .

Maintenance record

Found an error, outdated capability, or unclear category boundary? Submit a correction with a supporting source.