Kernels and Hardware Libraries

Definition, responsibilities, failure modes, and implementation guidance for kernels and hardware libraries.

Audience: Technical readers Reading time: 2 minutes Status: Production guidance Last reviewed: 2026-06-23 UTC

Kernels and Hardware Libraries cover Matrix multiplication, attention, convolution, normalization, quantization, vector operations, memory movement, collectives, and hardware-tuned primitives.

Key takeaways

Map mathematical operations to supported kernels
The boundary fails in recognizable ways such as unsupported shape or dtype.
A product may span this layer and adjacent layers; classify responsibilities rather than brand language.

Definition and scope

Matrix multiplication, attention, convolution, normalization, quantization, vector operations, memory movement, collectives, and hardware-tuned primitives.

Responsibilities

Map mathematical operations to supported kernels
Fuse operations to reduce memory traffic and launch overhead
Select algorithms by shape, precision, workspace, and determinism
Coordinate collective communication for parallel execution
Expose fallback behavior when optimized paths are unavailable

Inputs, outputs, and boundaries

The layer consumes artifacts or requests from the layer above and relies on services from the layer below. Its contract should define supported inputs, produced outputs, lifecycle, compatibility, resource ownership, and failure semantics.

Failure modes

Unsupported shape or dtype
Workspace exhaustion
Numerical or determinism change
Kernel regression
Collective timeout

Implementation guidance

Benchmark complete workload shapes rather than isolated peak FLOPS.
Record precision, algorithm, workspace, and determinism settings.
Provide a tested fallback path for unsupported operators.

Metrics

Measure the layer with workload-appropriate objectives. Avoid comparing unrelated categories or publishing unqualified performance numbers.

Dispatch, portability, and specialization

Kernel selection is a runtime decision constrained by operation shape, dtype, layout, alignment, device capability, available workspace, determinism mode, and numerical tolerance. A portable graph runtime may offer several execution providers, while a hardware-specific library can select among tuned implementations for one accelerator family. The portability boundary therefore belongs in a capability contract rather than in an assumption that every operator behaves identically everywhere.

Fallback must be explicit. An unsupported operator may execute on a CPU, use a slower reference kernel, trigger graph partitioning, or fail deployment. Silent fallback can move data across device boundaries, increase latency, or change numerical behavior. Production systems should expose the selected kernel family, fallback reason, workspace use, and device transfer cost in profiling output.

Operational review

Pin compatible driver, firmware, compiler, and library versions.
Test representative shapes, not only one benchmark shape.
Measure warm and cold dispatch, transfer, synchronization, and workspace allocation.
Define acceptable precision and deterministic-operation requirements.
Record fallback behavior for unsupported operations and new model versions.
Canary library updates against correctness and latency distributions before broad rollout.

Find runtime definitions and implementation guidance