Key takeaways
- Every optimization should target a measured bottleneck; otherwise it can move cost into another layer.
- Fusion and tiling reduce memory traffic but can increase register pressure, code size, and compilation complexity.
- Quantization changes capacity and quality; report exact format, calibration, kernel path, and task evaluation.
- Serving techniques such as batching, prefix reuse, cache offload, and speculative decoding are runtime optimizations even when the model graph is unchanged.
- Use workload-level Goodput or cost under quality and SLO constraints—not isolated kernel speed.
Runtime boundary
A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.
Receives
A measured workload, model/graph, target hardware, quality constraints, and baseline telemetry.
Owns
Optimization legality, target-specific configuration, quality validation hooks, and rollback comparison.
Emits
Rewritten graphs, selected kernels, compressed weights, schedules, cache policy, serving configuration, and benchmark evidence.
Does not own
Business permission to trade quality for speed without review.
Failure modes
Quality regression, numerical instability, register spilling, memory blow-up, worse tail latency, cache thrashing, and hardware lock-in.
Evidence and metrics
Latency distribution, throughput, Goodput, bytes moved, memory, cache hit, compile time, power, cost, and task quality.
Constant folding, canonicalization, and dead-code elimination
Graph passes remove work or normalize equivalent patterns before target lowering.
Implementation
Use verified rewrite rules and preserve source-to-optimized mapping for diagnosis.
Operational implications
These passes are usually low risk but can expose bugs in shape inference or custom operations.
Measure
Nodes removed, constants materialized, compile time, and parity.
Operator and kernel fusion
Fusion keeps intermediates on chip, reduces memory traffic, and reduces launch overhead.
Implementation
Fuse compatible producer-consumer regions while accounting for layouts, resource use, and backend support.
Operational implications
Oversized kernels can spill registers or lower occupancy. Evaluate the complete subgraph, not a theoretical launch count.
Measure
Kernel count, bytes moved, register spills, occupancy, and end-to-end latency.
Tiling, vectorization, and double buffering
Tiling fits working sets into cache or shared memory; vectorization maps work to SIMD/tensor units; double buffering overlaps transfer and compute.
Implementation
Tune tile sizes and layouts to hardware, shapes, and precision. Preserve selected schedules.
Operational implications
A schedule optimized for one shape can regress another. Autotuning cost and cache must be governed.
Measure
Cache hit, shared memory, occupancy, bandwidth, kernel time, and tuning time.
IO-aware attention kernels
FlashAttention-style kernels reduce materialization and high-bandwidth-memory traffic for attention.
Implementation
Verify supported masks, sequence lengths, head dimensions, precision, and hardware generation.
Operational implications
Specialized kernels may fall back for unsupported features; the route must remain visible.
Measure
Attention time, HBM traffic, kernel selection, fallback, and memory.
Quantization and mixed precision
Lower precision reduces stored bytes and may increase tensor throughput. Mixed precision preserves sensitive operations or accumulators.
Implementation
Name weight, activation, and KV precision separately; record scaling, group size, calibration, outlier treatment, and kernels.
Operational implications
Quality and numerical stability must be tested on representative tasks. Nominal bit width alone is not reproducible.
Measure
Artifact size, memory, throughput, latency, task quality, and numerical errors.
Pruning, sparsity, and codebooks
Sparsity removes or structures weights; codebooks/palletization store low-bit indices into representative values.
Implementation
Use hardware-supported sparsity patterns and kernels that consume the compressed representation directly.
Operational implications
Compression can save storage without speed if the runtime expands data or lacks sparse kernels.
Measure
Effective density, decode overhead, kernel support, memory traffic, and quality.
Static memory planning and buffer reuse
Known lifetimes allow buffers to be reused and allocations removed from the critical path.
Implementation
Model dynamic-shape ranges and backend-specific workspaces. Verify observed peak against the plan.
Operational implications
Overly static plans reduce flexibility and can fail under concurrency or rare shapes.
Measure
Peak bytes, allocation count, fragmentation, reuse ratio, and OOM.
Batching, caching, and scheduling
Continuous batching, prefix reuse, paged KV allocation, and cache-aware routing improve utilization and avoid repeated work.
Implementation
Optimize queue budgets, fairness, cache scope, and eviction using production traffic.
Operational implications
Maximum batch or cache hit rate can hide poor TTFT, tenant leakage, or thrash.
Measure
Goodput, p95/p99 TTFT/TPOT, active tokens, cache hit, prefill avoided, and fairness.
Speculative and parallel decoding
Draft predictions or multi-token heads reduce target decode steps when acceptance is high.
Implementation
Measure acceptance, draft/verify cost, memory, sampling constraints, and cache commit/rollback.
Operational implications
Speculation can regress saturated high-concurrency workloads or consume memory needed for more requests.
Measure
Acceptance, accepted tokens/pass, TPOT, memory, Goodput, and quality.
Optimization workflow
Optimization is a controlled experiment over a production-shaped baseline.
Implementation
Profile queue, host, transfer, kernel, cache, and quality; change bounded mechanisms; retain raw evidence and rollback.
Operational implications
Promote only when the selected workload metric improves without violating correctness, security, portability, or operations.
Measure
Baseline delta, confidence/run spread, quality gates, regression count, and rollback readiness.
Reference tables
| Technique | Optimizes | Primary trade-off | Measure |
|---|---|---|---|
| Constant folding / DCE | Unnecessary graph work | Compile complexity | Nodes and kernel count |
| Fusion | Launches and memory traffic | Register pressure/code size | Bytes moved, spills, latency |
| Tiling/vectorization | Locality and utilization | Shape/hardware sensitivity | Occupancy, bandwidth, kernel time |
| Quantization | Memory, bandwidth, compute | Quality and kernel support | Quality, size, Goodput |
| Static memory planning | Allocation and peak memory | Less dynamic flexibility | Peak bytes and allocations |
| Continuous batching | Serving utilization | Queue/fairness complexity | Goodput and tail latency |
| Prefix/KV reuse | Repeated prefill | Retention/privacy/eviction | Hit and prefill avoided |
| Speculative decoding | Target decode steps | Draft memory and verification | Acceptance and TPOT |
Decision checklist
- What measured bottleneck does the optimization target?
- Which runtime layer owns it and what new assumptions are introduced?
- What correctness and quality envelope must remain unchanged?
- How will tail latency, memory, power, and cost be measured?
- What fallback or disable switch exists?
- Does the result hold across production shapes and concurrency?
Common mistakes
- Applying quantization without task-quality evaluation.
- Reporting an isolated kernel speedup as an application speedup.
- Fusing until register pressure erases memory savings.
- Maximizing batch size while ignoring TTFT and fairness.
- Enabling prefix caching without tenant and retention policy.
- Comparing different models, precision, or sequence distributions as though only the runtime changed.
Sources and further reading
-
XLA architecture
(opens in a new tab)
-
TensorRT quantized types
(opens in a new tab)
-
vLLM documentation
(opens in a new tab)
-
TVM tutorials
(opens in a new tab)
Last reviewed: 2026-06-21 UTC
