speculative_num_steps/speculative_num_draft_tokens at runtime instead of keeping a single fixed value for the whole server lifetime.
It is designed for workloads whose accept length changes over time, where one static step count is rarely optimal.
Current support
- Only
--speculative-algorithm EAGLE - Only
--speculative-eagle-topk 1 - If either condition is not met, SGLang falls back to static speculative settings
Why adaptive steps help
speculative_num_steps controls how many draft-model autoregressive steps run in each speculative round. In practice, the best value depends on the current workload.
- If
num_stepsis too small, the draft model could have produced more accepted tokens, but the round stops too early. - If
num_stepsis too large, the draft model produces many candidate tokens that the target model rejects, so extra draft work is wasted.
num_steps.
Design overview
The adaptive mechanism has three pieces:AdaptiveSpeculativeParams: the EMA-based policySpecRuntimeState: the per-tier runtime state bundleAdaptiveController: the coordinator that chooses a tier and activates the matching runtime state
candidate_steps = [1, 3, 7].
This matters because CudaGraphRunner is shape-dependent. Each candidate tier owns its own graph and backend state, so runtime switching is a reference swap, not an online graph recapture.
Runtime flow
The adaptive update happens after verify and affects the next round, not the current one:Tier switch happens after the current round completes. Backends and CUDA graphs are never swapped mid-round.
How the policy decides
After each verify pass, SGLang reads the accepted draft length per request, computes the batch average, smooths it with an exponential moving average (EMA), and switches among the pre-built candidate tiers[1, 3, 7] by default.
The decision logic is intentionally conservative:
warmup_batchesskips the first few batchesupdate_intervalavoids switching every batchdown_hysteresisandup_hysteresisreduce oscillation
Usage
--speculative-adaptive-config is optional, but the speculative setup still needs to be valid for adaptive mode.
--speculative-adaptive-config /path/to/adaptive_spec.json.
Example config:
Config file reference
The config file is optional. Any omitted keys use defaults.| Key | Default | Meaning |
|---|---|---|
candidate_steps | [1, 3, 7] | Discrete speculative_num_steps tiers that adaptive mode can switch between |
ema_alpha | 0.2 | EMA smoothing factor for accepted draft length |
update_interval | 5 | Recompute interval, in verify batches, after warmup |
warmup_batches | 10 | Number of verify batches to observe before switching |
down_hysteresis | -0.25 | Extra margin before moving to a smaller step |
up_hysteresis | 0.0 | Extra margin before moving to a larger step |
--speculative-num-steps is snapped to the nearest value in candidate_steps.
Monitoring
You can inspect the active tier and acceptance metric via/server_info:
speculative_num_stepsis the current active tieravg_spec_accept_lengthhelps explain whether the server is likely to move up or down
Tuning tips
- Start with the default candidate tiers
[1, 3, 7] - Use fewer tiers if you want lower startup and graph-memory overhead
- Increase
ema_alphato react faster, or lower it for more stability - Increase
warmup_batchesorupdate_intervalif tier switching is too noisy - If your workload is already stable and one static setting is well tuned, adaptive mode may not help much
