Deployment
Install SGLang
Install SGLang
Laguna-XS-2.1 support is fully merged to SGLang Then run the Python output of the command panel below in that environment.
main (PR #29446: DFlash speculative decoding + shared-expert fix; PR #29761: INT4 loader fix). Any build at or past their merge covers every cell below.The model ships custom config code on the Hub, so --trust-remote-code is required (included in the launch commands).- Python (pip / uv)
- Docker
Command
- Low-latency — DFlash speculative decoding with a matched draft model. Pick for chat and interactive agents.
- High-throughput — plain serving. Best for batch workloads, where speculation’s draft + rejection overhead costs more than it saves.
--tp 8; FP8 and INT4 run --tp 8 --ep-size 8 because their quantization scales cannot shard the MoE 8-way (see Configuration Tips). The 4-GPU GB300 node runs plain --tp 4 throughout.
Playground
The Playground is where you experiment with SGLang features beyond the verified matrix. The Deploy panel above only emits combinations that have been signed off; the Playground lets you turn on additional knobs (TP degree, parsers) on top of whichever cell the Deploy panel is currently showing.1. Model Introduction
Laguna-XS-2.1 is an open-weight 33B-parameter hybrid sliding-window-attention MoE model (~3B active per token) from poolside, built for agentic coding and long-horizon software engineering — the extra-small sibling of Laguna-M.1. Key Features:- Sparse MoE: 40 layers, 256 routed experts, top-8 routing.
- Hybrid attention: 30 sliding-window layers (window 512) interleaved with 10 full-attention layers; 48 Q / 8 KV heads.
- Long context: 262,144 tokens (RoPE + YaRN on the full-attention layers).
- DFlash drafts: matched draft models (5-layer, ~0.9 GB) ship per quantization for low-latency serving.
- Hybrid reasoning:
<think>…</think>toggled per request viachat_template_kwargs={"enable_thinking": …}.
2. Configuration Tips
Attention backend Leave--attention-backend unset for High-throughput cells — auto-select is correct (fa3 on Hopper, trtllm_mha on Blackwell). With DFlash active, auto-select instead falls back to flashinfer, which breaks this hybrid-SWA model at tp ≥ 4 on Blackwell (greedy GSM8K 76% → 28%), so the Low-latency commands pin the target backend explicitly. Leave --speculative-draft-attention-backend unset. Never use triton attention with Laguna (GSM8K 13%).
Quantized checkpoints cap plain TP at 4
moe_intermediate_size=512 with FP8 block [128,128] / INT4 group_size=128 scales cannot shard 8-way (512/8 = 64 < 128 granularity): FP8 fails at weight creation, INT4 crashes in the Marlin kernel, on any hardware. The generated 8-GPU FP8/INT4 commands therefore use --tp 8 --ep-size 8 — expert parallelism keeps whole experts per rank, using all 8 GPUs on one instance. FP8 additionally needs SGLANG_SHARED_EXPERT_TP1=1 (its shared expert is also block-quantized; INT4’s stays bf16). Alternatives: plain --tp 4, or --tp 4 --dp-size 2. Accuracy is parallelism-independent within eval noise (verified tp1 ≡ tp4 on GB300 and tp4 ≡ tp8+ep8 on H200).
DFlash memory
Low-latency cells carry --mem-fraction-static 0.7: the default fraction OOMs in the draft vocab all-gather at tp 4 on GB300. Dense cells use the default heuristic.
INT4 is mixed-precision
The INT4 checkpoint quantizes MoE layers in mixed 4-bit / 8-bit config groups. Builds older than PR #29761 crash at load with KeyError: 'Linear'.
Chat template
On transformers ≥ 5.10 the standalone chat_template.jinja auto-loads — no flag needed (the server logs Auto-detected template features: reasoning_parser=poolside_v1, ...). On older transformers (≤ ~5.8) the {% include %} stub in tokenizer_config.json cannot resolve and the server silently falls back to a generic template — pass --chat-template <model-dir>/chat_template.jinja explicitly there.
Thinking
Off by default; opt in per request with extra_body={"chat_template_kwargs": {"enable_thinking": True}}. The template gates on enable_thinking — the generic thinking key is ignored.
Served model id
The server registers the model under whatever you pass to --model-path; a client’s model field must match it (poolside/Laguna-XS-2.1, or the -FP8 / -NVFP4 / -INT4 id).
3. Advanced Usage
3.1 DFlash Speculative Decoding
DFlash is a block-wise speculative decoder: the 5-layer draft proposes a block of tokens and the target verifies the whole block in one forward pass, so only target-approved tokens are emitted — output quality is the target’s by construction (GSM8K matches dense within noise on every quantization). The speedup lever is accept-length, the number of draft tokens surviving verification per target step:- Measured ~6 tokens/step at
tp 1, ~4 attp 4(greedy GSM8K, matched-precision pairs; ~3 under mixed reasoning-heavy traffic; FP8 reached 6.75 attp 8 + ep 8on H200) — versus 1 token/step dense. - Best for interactive / few-stream serving. Under batch-saturated load prefer High-throughput: once the GPU is compute-bound, draft + rejected-token overhead costs aggregate throughput.
- The generated commands always pair the draft calibrated for the selected target precision.
3.2 Reasoning
Launch with--reasoning-parser poolside_v1 (baked into every generated command). Reasoning is opt-in via enable_thinking=True; the <think> trace lands in message.reasoning_content, separate from the final answer in message.content.
Reasoning Example (Python)
Reasoning Example (Python)
Example
XS-2.1 is an extra-small model — give it generous
max_tokens when thinking is enabled
(hard problems regularly reason for thousands of tokens), and keep thinking off for
short-form tasks.3.3 Tool Calling
Launch with--tool-call-parser poolside_v1 (baked into every generated command). The parser converts Laguna’s <tool_call> output into the standard OpenAI tool_calls structure. Tool calling works with reasoning off (the default).
Tool Calling Example (Python)
Tool Calling Example (Python)
Example
