Search ARuntime.com

Find runtime definitions and implementation guidance

Search page titles, summaries, headings, glossary terms, use cases, and runtime-directory entries.

Enter at least two characters.

ARuntime Reference

Kernels and Hardware Libraries

Definition, responsibilities, failure modes, and implementation guidance for kernels and hardware libraries.

Audience: Technical readers Reading time: 2 minutes Status: Production guidance Last reviewed:

Kernels and Hardware Libraries cover Matrix multiplication, attention, convolution, normalization, quantization, vector operations, memory movement, collectives, and hardware-tuned primitives.

Key takeaways

  • Map mathematical operations to supported kernels
  • The boundary fails in recognizable ways such as unsupported shape or dtype.
  • A product may span this layer and adjacent layers; classify responsibilities rather than brand language.

Definition and scope

Matrix multiplication, attention, convolution, normalization, quantization, vector operations, memory movement, collectives, and hardware-tuned primitives.

Responsibilities

  • Map mathematical operations to supported kernels
  • Fuse operations to reduce memory traffic and launch overhead
  • Select algorithms by shape, precision, workspace, and determinism
  • Coordinate collective communication for parallel execution
  • Expose fallback behavior when optimized paths are unavailable

Inputs, outputs, and boundaries

The layer consumes artifacts or requests from the layer above and relies on services from the layer below. Its contract should define supported inputs, produced outputs, lifecycle, compatibility, resource ownership, and failure semantics.

Failure modes

  • Unsupported shape or dtype
  • Workspace exhaustion
  • Numerical or determinism change
  • Kernel regression
  • Collective timeout

Implementation guidance

  • Benchmark complete workload shapes rather than isolated peak FLOPS.
  • Record precision, algorithm, workspace, and determinism settings.
  • Provide a tested fallback path for unsupported operators.

Metrics

Measure the layer with workload-appropriate objectives. Avoid comparing unrelated categories or publishing unqualified performance numbers.

Dispatch, portability, and specialization

Kernel selection is a runtime decision constrained by operation shape, dtype, layout, alignment, device capability, available workspace, determinism mode, and numerical tolerance. A portable graph runtime may offer several execution providers, while a hardware-specific library can select among tuned implementations for one accelerator family. The portability boundary therefore belongs in a capability contract rather than in an assumption that every operator behaves identically everywhere.

Fallback must be explicit. An unsupported operator may execute on a CPU, use a slower reference kernel, trigger graph partitioning, or fail deployment. Silent fallback can move data across device boundaries, increase latency, or change numerical behavior. Production systems should expose the selected kernel family, fallback reason, workspace use, and device transfer cost in profiling output.

Operational review

  • Pin compatible driver, firmware, compiler, and library versions.
  • Test representative shapes, not only one benchmark shape.
  • Measure warm and cold dispatch, transfer, synchronization, and workspace allocation.
  • Define acceptable precision and deterministic-operation requirements.
  • Record fallback behavior for unsupported operations and new model versions.
  • Canary library updates against correctness and latency distributions before broad rollout.

Maintenance record

Found an error, outdated capability, or unclear category boundary? Submit a correction with a supporting source.