Key takeaways
- Peak arithmetic throughput is only one limit; memory bandwidth, capacity, on-chip storage, interconnect, power, and thermals often dominate.
- CPUs maximize reach and control flow, GPUs parallel throughput, and NPUs/custom accelerators energy efficiency for supported operations.
- Heterogeneous execution introduces partition boundaries and transfer costs that must be measured.
- Precision support is useful only when the model, runtime, and kernels implement it end to end.
- Firmware, drivers, compiler, runtime, libraries, and artifact form one tested compatibility chain.
Runtime boundary
A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.
Receives
Compiled kernels or graphs, tensor buffers, precision requirements, placement decisions, and device capability information.
Owns
Physical compute, memory hierarchy, device allocation, supported instructions and precision, and communication limits.
Emits
Executed operations, device events, memory transfers, utilization counters, and hardware-specific errors.
Does not own
Model semantics, request authorization, benchmark comparability, or application workflow.
Failure modes
Out-of-memory, unsupported operators, driver incompatibility, transfer bottlenecks, thermal throttling, topology mismatch, and underutilization.
Evidence and metrics
Device memory, memory bandwidth, transfer time, utilization, occupancy, power, thermals, queue depth, and error counters.
CPU targets
CPUs provide broad availability, large system memory, mature debugging, and strong scalar/control-flow execution.
Implementation
Use vectorized libraries, thread pools, NUMA-aware placement, pinned memory where appropriate, and model formats that support memory mapping.
Operational implications
CPUs are effective for small models, preprocessing, sparse/control-heavy tasks, local inference, and fallback. Cross-socket memory can become the bottleneck.
Measure
Socket/core utilization, memory bandwidth, NUMA misses, vector instruction use, latency, and power.
GPU targets
GPUs provide parallel matrix, attention, and media compute with high-bandwidth device memory.
Implementation
Keep weights and active state resident, batch compatible work, use fused kernels, and minimize host-device transfers.
Operational implications
GPU capacity is constrained by HBM/VRAM, power, cooling, and availability. Decode may be bandwidth-bound even when compute utilization looks low.
Measure
HBM used/free, bandwidth, SM utilization, kernel occupancy, transfer time, power, and throttling.
TPUs and matrix accelerators
Matrix-oriented accelerators can deliver high efficiency when compiler and model operations map well to their execution model.
Implementation
Use the supported compiler/runtime stack, static or bounded shapes, approved precision, and topology-aware sharding.
Operational implications
Portability is lower and compilation/runtime boundaries are central. Evaluate full workload support, not isolated matrix throughput.
Measure
Compile time, device utilization, host/device input time, collective time, and Goodput.
NPUs and DSPs
Mobile and edge NPUs optimize low-power inference for supported operator sets and data types.
Implementation
Use delegates or execution providers, quantized artifacts, capability discovery, and explicit CPU/GPU fallback policy.
Operational implications
Operator coverage and layout conversions can fragment graphs. Sustained thermal behavior matters more than short bursts.
Measure
Delegate coverage, fallback operations, transfer time, sustained latency, energy, and thermals.
FPGAs and custom accelerators
FPGAs can implement deterministic pipelines or custom dataflows, while fixed ASICs optimize narrower operation families.
Implementation
Use AOT compilation, static interfaces, bounded shapes, and vendor toolchain provenance.
Operational implications
They suit stable, latency- or energy-critical workloads where development and portability cost are justified.
Measure
End-to-end latency, initiation interval, resource utilization, power, compilation time, and supported model coverage.
Heterogeneous partitioning
One model may be split among CPU, GPU, NPU, or specialized libraries.
Implementation
Record the final partition map, tensor layouts, copies, synchronization, and fallback reason.
Operational implications
A nominally accelerated graph can be slower than CPU-only execution if unsupported nodes cause repeated transfers.
Measure
Partition count, bytes transferred, boundary latency, provider placement, and fallback share.
Fleet compatibility
Hardware capability depends on firmware, driver, compiler, runtime, and artifact versions.
Implementation
Maintain a certified compatibility tuple and rollout rings; attach device identity and software versions to traces and benchmarks.
Operational implications
Independent upgrades can change numerics, available kernels, memory use, or ABI behavior.
Measure
Compatibility-test pass rate, driver/runtime skew, model-load failures, and rollback time.
Reference tables
| Target | Typical strength | Primary constraint | Runtime implication |
|---|---|---|---|
| CPU | Portability, control flow, large system memory | Lower dense tensor throughput | Vectorized kernels, thread pools, NUMA awareness |
| GPU | Parallel matrix and attention workloads | HBM capacity, bandwidth, power | Batching, fusion, device-local caches |
| TPU / matrix ASIC | Efficient tensor programs | Compiler/platform coupling | Compiler-centered deployment and sharding |
| NPU / DSP | Low-power on-device inference | Restricted operator and precision set | Delegation, quantization, CPU fallback |
| FPGA | Deterministic custom pipelines | Toolchain and design complexity | AOT artifacts and fixed interfaces |
| Browser accelerator | Client-local compute | API and sandbox limits | Progressive enhancement and fallback |
Decision checklist
- Is the workload compute-, memory-, transfer-, latency-, power-, or capacity-bound?
- What weight, activation, and KV-cache capacity must remain resident?
- Which precision modes are supported end to end and quality-tested?
- What fallback path exists for unsupported operations?
- Which topology and interconnect assumptions are required?
- Can the fleet maintain a tested driver/runtime/artifact matrix?
Common mistakes
- Selecting hardware by peak FLOPS without modeling memory and communication.
- Assuming delegate or execution-provider handoff is free.
- Publishing a precision mode without reporting quality and kernel support.
- Ignoring sustained thermal throttling on edge hardware.
- Changing drivers independently of compiled artifacts and runtime validation.
Sources and further reading
-
CUDA C++ Programming Guide
(opens in a new tab)
-
Execution Providers
(opens in a new tab)
-
Web Neural Network API
(opens in a new tab)
-
Core ML
(opens in a new tab)
-
ExecuTorch backend delegation
(opens in a new tab)
Last reviewed: 2026-06-21 UTC
