Search ARuntime.com

Find runtime definitions and implementation guidance

Search page titles, summaries, headings, glossary terms, use cases, and runtime-directory entries.

Enter at least two characters.

Mechanics

Speculative Decoding

Learn speculative decoding mechanics, draft and target models, verification, acceptance rate, EAGLE, Medusa, multi-token prediction, integration constraints, metrics, and failure modes.

Audience: Technical readers Reading time: 5 minutes Status: Foundational Last reviewed:

Key takeaways

  • Speculation accelerates decode only when accepted tokens offset draft and verification overhead.
  • Correct implementations preserve the target distribution; speed claims must include quality-equivalence validation.
  • External draft models, learned draft modules, multiple heads, and native multi-token prediction have different training and memory requirements.
  • The technique often helps memory-bound, moderate-concurrency decode more than an already saturated server.
  • Structured constraints, adapters, quantization, and distributed placement can change acceptance and integration cost.

Runtime boundary

A useful architecture identifies what this layer receives, owns, emits, measures, and refuses to own. That boundary prevents overlapping products from being treated as interchangeable.

Receives

Target state, draft mechanism, tokenizer alignment, sampling settings, speculative length, request constraints, and cache state.

Owns

Proposal generation, target verification, acceptance sampling, cache commit/rollback, and per-request speculation policy.

Emits

Accepted token runs, first rejected position, corrected token, acceptance metrics, and committed target cache state.

Does not own

Permission to change the target distribution or claim universal speedup.

Failure modes

Low acceptance, draft overhead, cache inconsistency, tokenizer mismatch, excess memory, unsupported constraints, and worse saturated latency.

Evidence and metrics

Acceptance, accepted tokens/target pass, draft time, verification time, rollback, TPOT, Goodput, extra memory, and quality equivalence.

Draft, verify, accept, or correct

A draft mechanism proposes multiple candidates; the target evaluates them in one parallel pass. Accepted tokens are committed until the first rejection, then a target-consistent correction is sampled.

Implementation

Coordinate draft and target KV state, tokenization, sampling, and rollback. Validate mathematical distribution preservation or document an approximation.

Operational implications

Cache bookkeeping and draft scheduling can erase the theoretical gain.

Measure

Draft tokens, accepted tokens/pass, rejection position, draft/verify time, and rollback.

External draft models

A smaller compatible model proposes tokens for a larger target.

Implementation

Select a tokenizer-compatible, much cheaper draft and decide whether it shares hardware or uses separate resources.

Operational implications

It adds weight memory, cache, model lifecycle, and potentially distributed communication.

Measure

Draft memory, acceptance by task, draft latency, target passes saved, and Goodput.

EAGLE-style drafts

A learned draft module uses target hidden-state information to improve proposal alignment.

Implementation

Package the draft module and target as a compatible versioned pair; validate hidden-state and cache interfaces.

Operational implications

Model-specific training and runtime integration increase operational coupling.

Measure

Acceptance, module latency, hidden-state transfer, memory, and quality.

Medusa and multi-head trees

Multiple prediction heads propose a tree of future tokens from the target representation.

Implementation

Build tree verification and commit logic; cap branching according to hardware and workload.

Operational implications

Tree breadth increases verification work and implementation complexity.

Measure

Accepted path length, candidates verified, head overhead, TPOT, and memory.

Native multi-token prediction

Some models are trained with auxiliary heads or objectives to predict multiple future tokens.

Implementation

Use only with the matching model/runtime implementation and publish model-specific support.

Operational implications

It can simplify deployment but is not available for arbitrary checkpoints.

Measure

Head cost, acceptance/usable tokens, target-pass reduction, and Goodput.

When speculation helps

The best fit is memory-bound decode with target-pass cost high, draft much cheaper, and acceptance consistently strong.

Implementation

Segment tests by task, language, sampling, prompt type, concurrency, and cache state.

Operational implications

At high concurrency the target may already use parallel compute efficiently; draft work can reduce total capacity.

Measure

Speedup/Goodput by concurrency, acceptance distribution, memory, and tail latency.

Constraints and adapters

Grammar constraints, sampling, adapters, quantization, and distributed placement can alter proposal alignment.

Implementation

Apply compatible masks and sampling logic, measure adapter-specific acceptance, and avoid expensive device transfers.

Operational implications

A base-model draft can diverge from a fine-tuned target.

Measure

Acceptance by adapter/constraint, transfer, invalid output, and fallback.

Release and fallback

Speculation should be a bounded, observable optimization with ordinary decode available.

Implementation

Verify cache correctness under rejection, cancellation cleanup, quality equivalence, and automatic disable thresholds.

Operational implications

Disable when acceptance or Goodput falls below policy or memory pressure threatens concurrency.

Measure

Fallback rate, disable reason, quality gate, incident rate, and rollback time.

Reference tables

Speculative decoding patterns
Pattern Proposal source Strength Cost / limitation
External draft model Smaller compatible model General and understandable Extra model memory and scheduling
EAGLE-style draft Learned module using target representations Potentially high acceptance Training and model-specific integration
Medusa-style heads Multiple target-attached heads No full external model Tree verification and model modification
Native multi-token prediction Heads trained with base model Integrated path Only compatible models
N-gram/suffix methods Observed token patterns No draft model Workload-dependent acceptance

Decision checklist

  1. Is the workload decode-bound and does the device have verification headroom?
  2. What acceptance distribution is expected by task and sampling policy?
  3. How much memory do draft state and caches consume?
  4. How are cache commit and rollback implemented?
  5. Are structured outputs, adapters, and distributed placement supported?
  6. What ordinary-decode fallback exists?
  7. How is target-quality equivalence demonstrated?

Common mistakes

  • Quoting best-case speedup without acceptance and workload details.
  • Assuming tokenizer compatibility guarantees good draft alignment.
  • Ignoring concurrency lost to draft memory.
  • Testing only greedy decoding when production uses sampling.
  • Failing to verify cache state after rejected proposals.
  • Enabling speculation under saturation where verification adds contention.

Sources and further reading


  1. Speculative decoding
    (opens in a new tab)

    vLLM · Official documentation · accessed 2026-06-21 UTC

  2. Fast inference via speculative decoding
    (opens in a new tab)

    Leviathan et al. · Research paper · accessed 2026-06-21 UTC

  3. EAGLE repository
    (opens in a new tab)

    SafeAILab · Official research repository · accessed 2026-06-21 UTC

  4. Medusa repository
    (opens in a new tab)

    FasterDecoding · Official research repository · accessed 2026-06-21 UTC

  5. TensorRT-LLM speculative decoding
    (opens in a new tab)

    NVIDIA · Official documentation · accessed 2026-06-21 UTC

Last reviewed: 2026-06-21 UTC

Maintenance record

Found an error, outdated capability, or unclear category boundary? Submit a correction with a supporting source.