Search ARuntime.com

Find runtime definitions and implementation guidance

Search page titles, summaries, headings, glossary terms, use cases, and runtime-directory entries.

Enter at least two characters.

Foundations

Hardware Targets

Understand CPU, GPU, TPU, NPU, FPGA, browser, mobile, and heterogeneous AI hardware targets through memory bandwidth, precision, interconnect, power, and runtime constraints.

Audience: Technical readers Reading time: 5 minutes Status: Hardware Last reviewed:

Key takeaways

  • Peak arithmetic throughput is only one limit; memory bandwidth, capacity, on-chip storage, interconnect, power, and thermals often dominate.
  • CPUs maximize reach and control flow, GPUs parallel throughput, and NPUs/custom accelerators energy efficiency for supported operations.
  • Heterogeneous execution introduces partition boundaries and transfer costs that must be measured.
  • Precision support is useful only when the model, runtime, and kernels implement it end to end.
  • Firmware, drivers, compiler, runtime, libraries, and artifact form one tested compatibility chain.

Runtime boundary

A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.

Receives

Compiled kernels or graphs, tensor buffers, precision requirements, placement decisions, and device capability information.

Owns

Physical compute, memory hierarchy, device allocation, supported instructions and precision, and communication limits.

Emits

Executed operations, device events, memory transfers, utilization counters, and hardware-specific errors.

Does not own

Model semantics, request authorization, benchmark comparability, or application workflow.

Failure modes

Out-of-memory, unsupported operators, driver incompatibility, transfer bottlenecks, thermal throttling, topology mismatch, and underutilization.

Evidence and metrics

Device memory, memory bandwidth, transfer time, utilization, occupancy, power, thermals, queue depth, and error counters.

CPU targets

CPUs provide broad availability, large system memory, mature debugging, and strong scalar/control-flow execution.

Implementation

Use vectorized libraries, thread pools, NUMA-aware placement, pinned memory where appropriate, and model formats that support memory mapping.

Operational implications

CPUs are effective for small models, preprocessing, sparse/control-heavy tasks, local inference, and fallback. Cross-socket memory can become the bottleneck.

Measure

Socket/core utilization, memory bandwidth, NUMA misses, vector instruction use, latency, and power.

GPU targets

GPUs provide parallel matrix, attention, and media compute with high-bandwidth device memory.

Implementation

Keep weights and active state resident, batch compatible work, use fused kernels, and minimize host-device transfers.

Operational implications

GPU capacity is constrained by HBM/VRAM, power, cooling, and availability. Decode may be bandwidth-bound even when compute utilization looks low.

Measure

HBM used/free, bandwidth, SM utilization, kernel occupancy, transfer time, power, and throttling.

TPUs and matrix accelerators

Matrix-oriented accelerators can deliver high efficiency when compiler and model operations map well to their execution model.

Implementation

Use the supported compiler/runtime stack, static or bounded shapes, approved precision, and topology-aware sharding.

Operational implications

Portability is lower and compilation/runtime boundaries are central. Evaluate full workload support, not isolated matrix throughput.

Measure

Compile time, device utilization, host/device input time, collective time, and Goodput.

NPUs and DSPs

Mobile and edge NPUs optimize low-power inference for supported operator sets and data types.

Implementation

Use delegates or execution providers, quantized artifacts, capability discovery, and explicit CPU/GPU fallback policy.

Operational implications

Operator coverage and layout conversions can fragment graphs. Sustained thermal behavior matters more than short bursts.

Measure

Delegate coverage, fallback operations, transfer time, sustained latency, energy, and thermals.

FPGAs and custom accelerators

FPGAs can implement deterministic pipelines or custom dataflows, while fixed ASICs optimize narrower operation families.

Implementation

Use AOT compilation, static interfaces, bounded shapes, and vendor toolchain provenance.

Operational implications

They suit stable, latency- or energy-critical workloads where development and portability cost are justified.

Measure

End-to-end latency, initiation interval, resource utilization, power, compilation time, and supported model coverage.

Heterogeneous partitioning

One model may be split among CPU, GPU, NPU, or specialized libraries.

Implementation

Record the final partition map, tensor layouts, copies, synchronization, and fallback reason.

Operational implications

A nominally accelerated graph can be slower than CPU-only execution if unsupported nodes cause repeated transfers.

Measure

Partition count, bytes transferred, boundary latency, provider placement, and fallback share.

Fleet compatibility

Hardware capability depends on firmware, driver, compiler, runtime, and artifact versions.

Implementation

Maintain a certified compatibility tuple and rollout rings; attach device identity and software versions to traces and benchmarks.

Operational implications

Independent upgrades can change numerics, available kernels, memory use, or ABI behavior.

Measure

Compatibility-test pass rate, driver/runtime skew, model-load failures, and rollback time.

Reference tables

Hardware target comparison
Target Typical strength Primary constraint Runtime implication
CPU Portability, control flow, large system memory Lower dense tensor throughput Vectorized kernels, thread pools, NUMA awareness
GPU Parallel matrix and attention workloads HBM capacity, bandwidth, power Batching, fusion, device-local caches
TPU / matrix ASIC Efficient tensor programs Compiler/platform coupling Compiler-centered deployment and sharding
NPU / DSP Low-power on-device inference Restricted operator and precision set Delegation, quantization, CPU fallback
FPGA Deterministic custom pipelines Toolchain and design complexity AOT artifacts and fixed interfaces
Browser accelerator Client-local compute API and sandbox limits Progressive enhancement and fallback

Decision checklist

  1. Is the workload compute-, memory-, transfer-, latency-, power-, or capacity-bound?
  2. What weight, activation, and KV-cache capacity must remain resident?
  3. Which precision modes are supported end to end and quality-tested?
  4. What fallback path exists for unsupported operations?
  5. Which topology and interconnect assumptions are required?
  6. Can the fleet maintain a tested driver/runtime/artifact matrix?

Common mistakes

  • Selecting hardware by peak FLOPS without modeling memory and communication.
  • Assuming delegate or execution-provider handoff is free.
  • Publishing a precision mode without reporting quality and kernel support.
  • Ignoring sustained thermal throttling on edge hardware.
  • Changing drivers independently of compiled artifacts and runtime validation.

Sources and further reading


  1. CUDA C++ Programming Guide
    (opens in a new tab)

    NVIDIA · Official documentation · accessed 2026-06-21 UTC

  2. Execution Providers
    (opens in a new tab)

    ONNX Runtime · Official documentation · accessed 2026-06-21 UTC

  3. Web Neural Network API
    (opens in a new tab)

    W3C · Standard · accessed 2026-06-21 UTC

  4. Core ML
    (opens in a new tab)

    Apple · Official documentation · accessed 2026-06-21 UTC

  5. ExecuTorch backend delegation
    (opens in a new tab)

    PyTorch · Official documentation · accessed 2026-06-21 UTC

Last reviewed: 2026-06-21 UTC

Maintenance record

Found an error, outdated capability, or unclear category boundary? Submit a correction with a supporting source.