Kernels and Hardware Libraries cover Matrix multiplication, attention, convolution, normalization, quantization, vector operations, memory movement, collectives, and hardware-tuned primitives.
Key takeaways
- Map mathematical operations to supported kernels
- The boundary fails in recognizable ways such as unsupported shape or dtype.
- A product may span this layer and adjacent layers; classify responsibilities rather than brand language.
Definition and scope
Matrix multiplication, attention, convolution, normalization, quantization, vector operations, memory movement, collectives, and hardware-tuned primitives.
Responsibilities
- Map mathematical operations to supported kernels
- Fuse operations to reduce memory traffic and launch overhead
- Select algorithms by shape, precision, workspace, and determinism
- Coordinate collective communication for parallel execution
- Expose fallback behavior when optimized paths are unavailable
Inputs, outputs, and boundaries
The layer consumes artifacts or requests from the layer above and relies on services from the layer below. Its contract should define supported inputs, produced outputs, lifecycle, compatibility, resource ownership, and failure semantics.
Failure modes
- Unsupported shape or dtype
- Workspace exhaustion
- Numerical or determinism change
- Kernel regression
- Collective timeout
Implementation guidance
- Benchmark complete workload shapes rather than isolated peak FLOPS.
- Record precision, algorithm, workspace, and determinism settings.
- Provide a tested fallback path for unsupported operators.
Metrics
Measure the layer with workload-appropriate objectives. Avoid comparing unrelated categories or publishing unqualified performance numbers.
Dispatch, portability, and specialization
Kernel selection is a runtime decision constrained by operation shape, dtype, layout, alignment, device capability, available workspace, determinism mode, and numerical tolerance. A portable graph runtime may offer several execution providers, while a hardware-specific library can select among tuned implementations for one accelerator family. The portability boundary therefore belongs in a capability contract rather than in an assumption that every operator behaves identically everywhere.
Fallback must be explicit. An unsupported operator may execute on a CPU, use a slower reference kernel, trigger graph partitioning, or fail deployment. Silent fallback can move data across device boundaries, increase latency, or change numerical behavior. Production systems should expose the selected kernel family, fallback reason, workspace use, and device transfer cost in profiling output.
Operational review
- Pin compatible driver, firmware, compiler, and library versions.
- Test representative shapes, not only one benchmark shape.
- Measure warm and cold dispatch, transfer, synchronization, and workspace allocation.
- Define acceptable precision and deterministic-operation requirements.
- Record fallback behavior for unsupported operations and new model versions.
- Canary library updates against correctness and latency distributions before broad rollout.
