Optimization Techniques

Key takeaways

Every optimization should target a measured bottleneck; otherwise it can move cost into another layer.
Fusion and tiling reduce memory traffic but can increase register pressure, code size, and compilation complexity.
Quantization changes capacity and quality; report exact format, calibration, kernel path, and task evaluation.
Serving techniques such as batching, prefix reuse, cache offload, and speculative decoding are runtime optimizations even when the model graph is unchanged.
Use workload-level Goodput or cost under quality and SLO constraints—not isolated kernel speed.

Runtime boundary

A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.

Receives

A measured workload, model/graph, target hardware, quality constraints, and baseline telemetry.

Owns

Optimization legality, target-specific configuration, quality validation hooks, and rollback comparison.

Emits

Rewritten graphs, selected kernels, compressed weights, schedules, cache policy, serving configuration, and benchmark evidence.

Does not own

Business permission to trade quality for speed without review.

Failure modes

Quality regression, numerical instability, register spilling, memory blow-up, worse tail latency, cache thrashing, and hardware lock-in.

Evidence and metrics

Latency distribution, throughput, Goodput, bytes moved, memory, cache hit, compile time, power, cost, and task quality.

Constant folding, canonicalization, and dead-code elimination

Graph passes remove work or normalize equivalent patterns before target lowering.

Implementation

Use verified rewrite rules and preserve source-to-optimized mapping for diagnosis.

Operational implications

These passes are usually low risk but can expose bugs in shape inference or custom operations.

Measure

Nodes removed, constants materialized, compile time, and parity.

Operator and kernel fusion

Fusion keeps intermediates on chip, reduces memory traffic, and reduces launch overhead.

Implementation

Fuse compatible producer-consumer regions while accounting for layouts, resource use, and backend support.

Operational implications

Oversized kernels can spill registers or lower occupancy. Evaluate the complete subgraph, not a theoretical launch count.

Measure

Kernel count, bytes moved, register spills, occupancy, and end-to-end latency.

Tiling, vectorization, and double buffering

Tiling fits working sets into cache or shared memory; vectorization maps work to SIMD/tensor units; double buffering overlaps transfer and compute.

Implementation

Tune tile sizes and layouts to hardware, shapes, and precision. Preserve selected schedules.

Operational implications

A schedule optimized for one shape can regress another. Autotuning cost and cache must be governed.

Measure

Cache hit, shared memory, occupancy, bandwidth, kernel time, and tuning time.

IO-aware attention kernels

FlashAttention-style kernels reduce materialization and high-bandwidth-memory traffic for attention.

Implementation

Verify supported masks, sequence lengths, head dimensions, precision, and hardware generation.

Operational implications

Specialized kernels may fall back for unsupported features; the route must remain visible.

Measure

Attention time, HBM traffic, kernel selection, fallback, and memory.

Quantization and mixed precision

Lower precision reduces stored bytes and may increase tensor throughput. Mixed precision preserves sensitive operations or accumulators.

Implementation

Name weight, activation, and KV precision separately; record scaling, group size, calibration, outlier treatment, and kernels.

Operational implications

Quality and numerical stability must be tested on representative tasks. Nominal bit width alone is not reproducible.

Measure

Artifact size, memory, throughput, latency, task quality, and numerical errors.

Pruning, sparsity, and codebooks

Sparsity removes or structures weights; codebooks/palletization store low-bit indices into representative values.

Implementation

Use hardware-supported sparsity patterns and kernels that consume the compressed representation directly.

Operational implications

Compression can save storage without speed if the runtime expands data or lacks sparse kernels.

Measure

Effective density, decode overhead, kernel support, memory traffic, and quality.

Static memory planning and buffer reuse

Known lifetimes allow buffers to be reused and allocations removed from the critical path.

Implementation

Model dynamic-shape ranges and backend-specific workspaces. Verify observed peak against the plan.

Operational implications

Overly static plans reduce flexibility and can fail under concurrency or rare shapes.

Measure

Peak bytes, allocation count, fragmentation, reuse ratio, and OOM.

Batching, caching, and scheduling

Continuous batching, prefix reuse, paged KV allocation, and cache-aware routing improve utilization and avoid repeated work.

Implementation

Optimize queue budgets, fairness, cache scope, and eviction using production traffic.

Operational implications

Maximum batch or cache hit rate can hide poor TTFT, tenant leakage, or thrash.

Measure

Goodput, p95/p99 TTFT/TPOT, active tokens, cache hit, prefill avoided, and fairness.

Speculative and parallel decoding

Draft predictions or multi-token heads reduce target decode steps when acceptance is high.

Implementation

Measure acceptance, draft/verify cost, memory, sampling constraints, and cache commit/rollback.

Operational implications

Speculation can regress saturated high-concurrency workloads or consume memory needed for more requests.

Measure

Acceptance, accepted tokens/pass, TPOT, memory, Goodput, and quality.

Optimization workflow

Optimization is a controlled experiment over a production-shaped baseline.

Implementation

Profile queue, host, transfer, kernel, cache, and quality; change bounded mechanisms; retain raw evidence and rollback.

Operational implications

Promote only when the selected workload metric improves without violating correctness, security, portability, or operations.

Measure

Baseline delta, confidence/run spread, quality gates, regression count, and rollback readiness.

Reference tables

Optimization technique matrix
Technique	Optimizes	Primary trade-off	Measure
Constant folding / DCE	Unnecessary graph work	Compile complexity	Nodes and kernel count
Fusion	Launches and memory traffic	Register pressure/code size	Bytes moved, spills, latency
Tiling/vectorization	Locality and utilization	Shape/hardware sensitivity	Occupancy, bandwidth, kernel time
Quantization	Memory, bandwidth, compute	Quality and kernel support	Quality, size, Goodput
Static memory planning	Allocation and peak memory	Less dynamic flexibility	Peak bytes and allocations
Continuous batching	Serving utilization	Queue/fairness complexity	Goodput and tail latency
Prefix/KV reuse	Repeated prefill	Retention/privacy/eviction	Hit and prefill avoided
Speculative decoding	Target decode steps	Draft memory and verification	Acceptance and TPOT

Decision checklist

What measured bottleneck does the optimization target?
Which runtime layer owns it and what new assumptions are introduced?
What correctness and quality envelope must remain unchanged?
How will tail latency, memory, power, and cost be measured?
What fallback or disable switch exists?
Does the result hold across production shapes and concurrency?

Common mistakes

Applying quantization without task-quality evaluation.
Reporting an isolated kernel speedup as an application speedup.
Fusing until register pressure erases memory savings.
Maximizing batch size while ignoring TTFT and fairness.
Enabling prefix caching without tenant and retention policy.
Comparing different models, precision, or sequence distributions as though only the runtime changed.

Sources and further reading

XLA architecture
(opens in a new tab)

OpenXLA · Official documentation · accessed 2026-06-21 UTC
TensorRT quantized types
(opens in a new tab)

NVIDIA · Official documentation · accessed 2026-06-21 UTC
vLLM documentation
(opens in a new tab)

vLLM · Official documentation · accessed 2026-06-21 UTC
TVM tutorials
(opens in a new tab)

Apache TVM · Official documentation · accessed 2026-06-21 UTC

Last reviewed: 2026-06-21 UTC

Key takeaways

Runtime boundary

Receives

Owns

Emits

Does not own

Failure modes

Evidence and metrics

Constant folding, canonicalization, and dead-code elimination

Implementation

Operational implications

Measure

Operator and kernel fusion

Implementation

Operational implications

Measure

Tiling, vectorization, and double buffering

Implementation

Operational implications

Measure

IO-aware attention kernels

Implementation

Operational implications

Measure

Quantization and mixed precision

Implementation

Operational implications

Measure

Pruning, sparsity, and codebooks

Implementation

Operational implications

Measure

Static memory planning and buffer reuse

Implementation

Operational implications

Measure

Batching, caching, and scheduling

Implementation

Operational implications

Measure

Speculative and parallel decoding

Implementation

Operational implications

Measure

Optimization workflow

Implementation

Operational implications

Measure

Reference tables

Decision checklist

Common mistakes

Sources and further reading

Maintenance record