Hardware Targets - aRuntime.com

Key takeaways

Peak arithmetic throughput is only one limit; memory bandwidth, capacity, on-chip storage, interconnect, power, and thermals often dominate.
CPUs maximize reach and control flow, GPUs parallel throughput, and NPUs/custom accelerators energy efficiency for supported operations.
Heterogeneous execution introduces partition boundaries and transfer costs that must be measured.
Precision support is useful only when the model, runtime, and kernels implement it end to end.
Firmware, drivers, compiler, runtime, libraries, and artifact form one tested compatibility chain.

Runtime boundary

A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.

Receives

Compiled kernels or graphs, tensor buffers, precision requirements, placement decisions, and device capability information.

Owns

Physical compute, memory hierarchy, device allocation, supported instructions and precision, and communication limits.

Emits

Executed operations, device events, memory transfers, utilization counters, and hardware-specific errors.

Does not own

Model semantics, request authorization, benchmark comparability, or application workflow.

Failure modes

Out-of-memory, unsupported operators, driver incompatibility, transfer bottlenecks, thermal throttling, topology mismatch, and underutilization.

Evidence and metrics

Device memory, memory bandwidth, transfer time, utilization, occupancy, power, thermals, queue depth, and error counters.

CPU targets

CPUs provide broad availability, large system memory, mature debugging, and strong scalar/control-flow execution.

Implementation

Use vectorized libraries, thread pools, NUMA-aware placement, pinned memory where appropriate, and model formats that support memory mapping.

Operational implications

CPUs are effective for small models, preprocessing, sparse/control-heavy tasks, local inference, and fallback. Cross-socket memory can become the bottleneck.

Measure

Socket/core utilization, memory bandwidth, NUMA misses, vector instruction use, latency, and power.

GPU targets

GPUs provide parallel matrix, attention, and media compute with high-bandwidth device memory.

Implementation

Keep weights and active state resident, batch compatible work, use fused kernels, and minimize host-device transfers.

Operational implications

GPU capacity is constrained by HBM/VRAM, power, cooling, and availability. Decode may be bandwidth-bound even when compute utilization looks low.

Measure

HBM used/free, bandwidth, SM utilization, kernel occupancy, transfer time, power, and throttling.

TPUs and matrix accelerators

Matrix-oriented accelerators can deliver high efficiency when compiler and model operations map well to their execution model.

Implementation

Use the supported compiler/runtime stack, static or bounded shapes, approved precision, and topology-aware sharding.

Operational implications

Portability is lower and compilation/runtime boundaries are central. Evaluate full workload support, not isolated matrix throughput.

Measure

Compile time, device utilization, host/device input time, collective time, and Goodput.

NPUs and DSPs

Mobile and edge NPUs optimize low-power inference for supported operator sets and data types.

Implementation

Use delegates or execution providers, quantized artifacts, capability discovery, and explicit CPU/GPU fallback policy.

Operational implications

Operator coverage and layout conversions can fragment graphs. Sustained thermal behavior matters more than short bursts.

Measure

Delegate coverage, fallback operations, transfer time, sustained latency, energy, and thermals.

FPGAs and custom accelerators

FPGAs can implement deterministic pipelines or custom dataflows, while fixed ASICs optimize narrower operation families.

Implementation

Use AOT compilation, static interfaces, bounded shapes, and vendor toolchain provenance.

Operational implications

They suit stable, latency- or energy-critical workloads where development and portability cost are justified.

Measure

End-to-end latency, initiation interval, resource utilization, power, compilation time, and supported model coverage.

Heterogeneous partitioning

One model may be split among CPU, GPU, NPU, or specialized libraries.

Implementation

Record the final partition map, tensor layouts, copies, synchronization, and fallback reason.

Operational implications

A nominally accelerated graph can be slower than CPU-only execution if unsupported nodes cause repeated transfers.

Measure

Partition count, bytes transferred, boundary latency, provider placement, and fallback share.

Fleet compatibility

Hardware capability depends on firmware, driver, compiler, runtime, and artifact versions.

Implementation

Maintain a certified compatibility tuple and rollout rings; attach device identity and software versions to traces and benchmarks.

Operational implications

Independent upgrades can change numerics, available kernels, memory use, or ABI behavior.

Measure

Compatibility-test pass rate, driver/runtime skew, model-load failures, and rollback time.

Reference tables

Hardware target comparison
Target	Typical strength	Primary constraint	Runtime implication
CPU	Portability, control flow, large system memory	Lower dense tensor throughput	Vectorized kernels, thread pools, NUMA awareness
GPU	Parallel matrix and attention workloads	HBM capacity, bandwidth, power	Batching, fusion, device-local caches
TPU / matrix ASIC	Efficient tensor programs	Compiler/platform coupling	Compiler-centered deployment and sharding
NPU / DSP	Low-power on-device inference	Restricted operator and precision set	Delegation, quantization, CPU fallback
FPGA	Deterministic custom pipelines	Toolchain and design complexity	AOT artifacts and fixed interfaces
Browser accelerator	Client-local compute	API and sandbox limits	Progressive enhancement and fallback

Decision checklist

Is the workload compute-, memory-, transfer-, latency-, power-, or capacity-bound?
What weight, activation, and KV-cache capacity must remain resident?
Which precision modes are supported end to end and quality-tested?
What fallback path exists for unsupported operations?
Which topology and interconnect assumptions are required?
Can the fleet maintain a tested driver/runtime/artifact matrix?

Common mistakes

Selecting hardware by peak FLOPS without modeling memory and communication.
Assuming delegate or execution-provider handoff is free.
Publishing a precision mode without reporting quality and kernel support.
Ignoring sustained thermal throttling on edge hardware.
Changing drivers independently of compiled artifacts and runtime validation.

Sources and further reading

CUDA C++ Programming Guide
(opens in a new tab)

NVIDIA · Official documentation · accessed 2026-06-21 UTC
Execution Providers
(opens in a new tab)

ONNX Runtime · Official documentation · accessed 2026-06-21 UTC
Web Neural Network API
(opens in a new tab)

W3C · Standard · accessed 2026-06-21 UTC
Core ML
(opens in a new tab)

Apple · Official documentation · accessed 2026-06-21 UTC
ExecuTorch backend delegation
(opens in a new tab)

PyTorch · Official documentation · accessed 2026-06-21 UTC

Last reviewed: 2026-06-21 UTC

Key takeaways

Runtime boundary

Receives

Owns

Emits

Does not own

Failure modes

Evidence and metrics

CPU targets

Implementation

Operational implications

Measure

GPU targets

Implementation

Operational implications

Measure

TPUs and matrix accelerators

Implementation

Operational implications

Measure

NPUs and DSPs

Implementation

Operational implications

Measure

FPGAs and custom accelerators

Implementation

Operational implications

Measure

Heterogeneous partitioning

Implementation

Operational implications

Measure

Fleet compatibility

Implementation

Operational implications

Measure

Reference tables

Decision checklist

Common mistakes

Sources and further reading

Maintenance record