Key takeaways
- Speculation accelerates decode only when accepted tokens offset draft and verification overhead.
- Correct implementations preserve the target distribution; speed claims must include quality-equivalence validation.
- External draft models, learned draft modules, multiple heads, and native multi-token prediction have different training and memory requirements.
- The technique often helps memory-bound, moderate-concurrency decode more than an already saturated server.
- Structured constraints, adapters, quantization, and distributed placement can change acceptance and integration cost.
Runtime boundary
A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.
Receives
Target state, draft mechanism, tokenizer alignment, sampling settings, speculative length, request constraints, and cache state.
Owns
Proposal generation, target verification, acceptance sampling, cache commit/rollback, and per-request speculation policy.
Emits
Accepted token runs, first rejected position, corrected token, acceptance metrics, and committed target cache state.
Does not own
Permission to change the target distribution or claim universal speedup.
Failure modes
Low acceptance, draft overhead, cache inconsistency, tokenizer mismatch, excess memory, unsupported constraints, and worse saturated latency.
Evidence and metrics
Acceptance, accepted tokens/target pass, draft time, verification time, rollback, TPOT, Goodput, extra memory, and quality equivalence.
Draft, verify, accept, or correct
A draft mechanism proposes multiple candidates; the target evaluates them in one parallel pass. Accepted tokens are committed until the first rejection, then a target-consistent correction is sampled.
Implementation
Coordinate draft and target KV state, tokenization, sampling, and rollback. Validate mathematical distribution preservation or document an approximation.
Operational implications
Cache bookkeeping and draft scheduling can erase the theoretical gain.
Measure
Draft tokens, accepted tokens/pass, rejection position, draft/verify time, and rollback.
External draft models
A smaller compatible model proposes tokens for a larger target.
Implementation
Select a tokenizer-compatible, much cheaper draft and decide whether it shares hardware or uses separate resources.
Operational implications
It adds weight memory, cache, model lifecycle, and potentially distributed communication.
Measure
Draft memory, acceptance by task, draft latency, target passes saved, and Goodput.
EAGLE-style drafts
A learned draft module uses target hidden-state information to improve proposal alignment.
Implementation
Package the draft module and target as a compatible versioned pair; validate hidden-state and cache interfaces.
Operational implications
Model-specific training and runtime integration increase operational coupling.
Measure
Acceptance, module latency, hidden-state transfer, memory, and quality.
Medusa and multi-head trees
Multiple prediction heads propose a tree of future tokens from the target representation.
Implementation
Build tree verification and commit logic; cap branching according to hardware and workload.
Operational implications
Tree breadth increases verification work and implementation complexity.
Measure
Accepted path length, candidates verified, head overhead, TPOT, and memory.
Native multi-token prediction
Some models are trained with auxiliary heads or objectives to predict multiple future tokens.
Implementation
Use only with the matching model/runtime implementation and publish model-specific support.
Operational implications
It can simplify deployment but is not available for arbitrary checkpoints.
Measure
Head cost, acceptance/usable tokens, target-pass reduction, and Goodput.
When speculation helps
The best fit is memory-bound decode with target-pass cost high, draft much cheaper, and acceptance consistently strong.
Implementation
Segment tests by task, language, sampling, prompt type, concurrency, and cache state.
Operational implications
At high concurrency the target may already use parallel compute efficiently; draft work can reduce total capacity.
Measure
Speedup/Goodput by concurrency, acceptance distribution, memory, and tail latency.
Constraints and adapters
Grammar constraints, sampling, adapters, quantization, and distributed placement can alter proposal alignment.
Implementation
Apply compatible masks and sampling logic, measure adapter-specific acceptance, and avoid expensive device transfers.
Operational implications
A base-model draft can diverge from a fine-tuned target.
Measure
Acceptance by adapter/constraint, transfer, invalid output, and fallback.
Release and fallback
Speculation should be a bounded, observable optimization with ordinary decode available.
Implementation
Verify cache correctness under rejection, cancellation cleanup, quality equivalence, and automatic disable thresholds.
Operational implications
Disable when acceptance or Goodput falls below policy or memory pressure threatens concurrency.
Measure
Fallback rate, disable reason, quality gate, incident rate, and rollback time.
Reference tables
| Pattern | Proposal source | Strength | Cost / limitation |
|---|---|---|---|
| External draft model | Smaller compatible model | General and understandable | Extra model memory and scheduling |
| EAGLE-style draft | Learned module using target representations | Potentially high acceptance | Training and model-specific integration |
| Medusa-style heads | Multiple target-attached heads | No full external model | Tree verification and model modification |
| Native multi-token prediction | Heads trained with base model | Integrated path | Only compatible models |
| N-gram/suffix methods | Observed token patterns | No draft model | Workload-dependent acceptance |
Decision checklist
- Is the workload decode-bound and does the device have verification headroom?
- What acceptance distribution is expected by task and sampling policy?
- How much memory do draft state and caches consume?
- How are cache commit and rollback implemented?
- Are structured outputs, adapters, and distributed placement supported?
- What ordinary-decode fallback exists?
- How is target-quality equivalence demonstrated?
Common mistakes
- Quoting best-case speedup without acceptance and workload details.
- Assuming tokenizer compatibility guarantees good draft alignment.
- Ignoring concurrency lost to draft memory.
- Testing only greedy decoding when production uses sampling.
- Failing to verify cache state after rejected proposals.
- Enabling speculation under saturation where verification adds contention.
Sources and further reading
-
Speculative decoding
(opens in a new tab)
-
Fast inference via speculative decoding
(opens in a new tab)
-
EAGLE repository
(opens in a new tab)
-
Medusa repository
(opens in a new tab)
-
TensorRT-LLM speculative decoding
(opens in a new tab)
Last reviewed: 2026-06-21 UTC
