Speculative Decoding - aRuntime.com

Key takeaways

Speculation accelerates decode only when accepted tokens offset draft and verification overhead.
Correct implementations preserve the target distribution; speed claims must include quality-equivalence validation.
External draft models, learned draft modules, multiple heads, and native multi-token prediction have different training and memory requirements.
The technique often helps memory-bound, moderate-concurrency decode more than an already saturated server.
Structured constraints, adapters, quantization, and distributed placement can change acceptance and integration cost.

Runtime boundary

A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.

Receives

Target state, draft mechanism, tokenizer alignment, sampling settings, speculative length, request constraints, and cache state.

Owns

Proposal generation, target verification, acceptance sampling, cache commit/rollback, and per-request speculation policy.

Emits

Accepted token runs, first rejected position, corrected token, acceptance metrics, and committed target cache state.

Does not own

Permission to change the target distribution or claim universal speedup.

Failure modes

Low acceptance, draft overhead, cache inconsistency, tokenizer mismatch, excess memory, unsupported constraints, and worse saturated latency.

Evidence and metrics

Acceptance, accepted tokens/target pass, draft time, verification time, rollback, TPOT, Goodput, extra memory, and quality equivalence.

Draft, verify, accept, or correct

A draft mechanism proposes multiple candidates; the target evaluates them in one parallel pass. Accepted tokens are committed until the first rejection, then a target-consistent correction is sampled.

Implementation

Coordinate draft and target KV state, tokenization, sampling, and rollback. Validate mathematical distribution preservation or document an approximation.

Operational implications

Cache bookkeeping and draft scheduling can erase the theoretical gain.

Measure

Draft tokens, accepted tokens/pass, rejection position, draft/verify time, and rollback.

External draft models

A smaller compatible model proposes tokens for a larger target.

Implementation

Select a tokenizer-compatible, much cheaper draft and decide whether it shares hardware or uses separate resources.

Operational implications

It adds weight memory, cache, model lifecycle, and potentially distributed communication.

Measure

Draft memory, acceptance by task, draft latency, target passes saved, and Goodput.

EAGLE-style drafts

A learned draft module uses target hidden-state information to improve proposal alignment.

Implementation

Package the draft module and target as a compatible versioned pair; validate hidden-state and cache interfaces.

Operational implications

Model-specific training and runtime integration increase operational coupling.

Measure

Acceptance, module latency, hidden-state transfer, memory, and quality.

Medusa and multi-head trees

Multiple prediction heads propose a tree of future tokens from the target representation.

Implementation

Build tree verification and commit logic; cap branching according to hardware and workload.

Operational implications

Tree breadth increases verification work and implementation complexity.

Measure

Accepted path length, candidates verified, head overhead, TPOT, and memory.

Native multi-token prediction

Some models are trained with auxiliary heads or objectives to predict multiple future tokens.

Implementation

Use only with the matching model/runtime implementation and publish model-specific support.

Operational implications

It can simplify deployment but is not available for arbitrary checkpoints.

Measure

Head cost, acceptance/usable tokens, target-pass reduction, and Goodput.

When speculation helps

The best fit is memory-bound decode with target-pass cost high, draft much cheaper, and acceptance consistently strong.

Implementation

Segment tests by task, language, sampling, prompt type, concurrency, and cache state.

Operational implications

At high concurrency the target may already use parallel compute efficiently; draft work can reduce total capacity.

Measure

Speedup/Goodput by concurrency, acceptance distribution, memory, and tail latency.

Constraints and adapters

Grammar constraints, sampling, adapters, quantization, and distributed placement can alter proposal alignment.

Implementation

Apply compatible masks and sampling logic, measure adapter-specific acceptance, and avoid expensive device transfers.

Operational implications

A base-model draft can diverge from a fine-tuned target.

Measure

Acceptance by adapter/constraint, transfer, invalid output, and fallback.

Release and fallback

Speculation should be a bounded, observable optimization with ordinary decode available.

Implementation

Verify cache correctness under rejection, cancellation cleanup, quality equivalence, and automatic disable thresholds.

Operational implications

Disable when acceptance or Goodput falls below policy or memory pressure threatens concurrency.

Measure

Fallback rate, disable reason, quality gate, incident rate, and rollback time.

Reference tables

Speculative decoding patterns
Pattern	Proposal source	Strength	Cost / limitation
External draft model	Smaller compatible model	General and understandable	Extra model memory and scheduling
EAGLE-style draft	Learned module using target representations	Potentially high acceptance	Training and model-specific integration
Medusa-style heads	Multiple target-attached heads	No full external model	Tree verification and model modification
Native multi-token prediction	Heads trained with base model	Integrated path	Only compatible models
N-gram/suffix methods	Observed token patterns	No draft model	Workload-dependent acceptance

Decision checklist

Is the workload decode-bound and does the device have verification headroom?
What acceptance distribution is expected by task and sampling policy?
How much memory do draft state and caches consume?
How are cache commit and rollback implemented?
Are structured outputs, adapters, and distributed placement supported?
What ordinary-decode fallback exists?
How is target-quality equivalence demonstrated?

Common mistakes

Quoting best-case speedup without acceptance and workload details.
Assuming tokenizer compatibility guarantees good draft alignment.
Ignoring concurrency lost to draft memory.
Testing only greedy decoding when production uses sampling.
Failing to verify cache state after rejected proposals.
Enabling speculation under saturation where verification adds contention.

Sources and further reading

Speculative decoding
(opens in a new tab)

vLLM · Official documentation · accessed 2026-06-21 UTC
Fast inference via speculative decoding
(opens in a new tab)

Leviathan et al. · Research paper · accessed 2026-06-21 UTC
EAGLE repository
(opens in a new tab)

SafeAILab · Official research repository · accessed 2026-06-21 UTC
Medusa repository
(opens in a new tab)

FasterDecoding · Official research repository · accessed 2026-06-21 UTC
TensorRT-LLM speculative decoding
(opens in a new tab)

NVIDIA · Official documentation · accessed 2026-06-21 UTC

Last reviewed: 2026-06-21 UTC

Key takeaways

Runtime boundary

Receives

Owns

Emits

Does not own

Failure modes

Evidence and metrics

Draft, verify, accept, or correct

Implementation

Operational implications

Measure

External draft models

Implementation

Operational implications

Measure

EAGLE-style drafts

Implementation

Operational implications

Measure

Medusa and multi-head trees

Implementation

Operational implications

Measure

Native multi-token prediction

Implementation

Operational implications

Measure

When speculation helps

Implementation

Operational implications

Measure

Constraints and adapters

Implementation

Operational implications

Measure

Release and fallback

Implementation

Operational implications

Measure

Reference tables

Decision checklist

Common mistakes

Sources and further reading

Maintenance record