Skip to main content

Deployment

Laguna-XS-2.1 support is fully merged to SGLang main (PR #29446: DFlash speculative decoding + shared-expert fix; PR #29761: INT4 loader fix). Any build at or past their merge covers every cell below.The model ships custom config code on the Hub, so --trust-remote-code is required (included in the launch commands).
Command
pip install -U uv
uv venv --python 3.12 && source .venv/bin/activate

git clone https://github.com/sgl-project/sglang.git
cd sglang
uv pip install -e python
Then run the Python output of the command panel below in that environment.
Pick your hardware + quantization + strategy to generate the launch command. The two serving strategies cover the common operating points:
  • Low-latency — DFlash speculative decoding with a matched draft model. Pick for chat and interactive agents.
  • High-throughput — plain serving. Best for batch workloads, where speculation’s draft + rejection overhead costs more than it saves.
On the 8-GPU HGX platforms (H200 / B300), BF16 and NVFP4 run plain --tp 8; FP8 and INT4 run --tp 8 --ep-size 8 because their quantization scales cannot shard the MoE 8-way (see Configuration Tips). The 4-GPU GB300 node runs plain --tp 4 throughout.

Playground

The Playground is where you experiment with SGLang features beyond the verified matrix. The Deploy panel above only emits combinations that have been signed off; the Playground lets you turn on additional knobs (TP degree, parsers) on top of whichever cell the Deploy panel is currently showing.

1. Model Introduction

Laguna-XS-2.1 is an open-weight 33B-parameter hybrid sliding-window-attention MoE model (~3B active per token) from poolside, built for agentic coding and long-horizon software engineering — the extra-small sibling of Laguna-M.1. Key Features:
  • Sparse MoE: 40 layers, 256 routed experts, top-8 routing.
  • Hybrid attention: 30 sliding-window layers (window 512) interleaved with 10 full-attention layers; 48 Q / 8 KV heads.
  • Long context: 262,144 tokens (RoPE + YaRN on the full-attention layers).
  • DFlash drafts: matched draft models (5-layer, ~0.9 GB) ship per quantization for low-latency serving.
  • Hybrid reasoning: <think>…</think> toggled per request via chat_template_kwargs={"enable_thinking": …}.
Available quantizations:
PrecisionTarget modelDraft model
BF16poolside/Laguna-XS-2.1poolside/Laguna-XS-2.1-DFlash
FP8poolside/Laguna-XS-2.1-FP8poolside/Laguna-XS-2.1-DFlash-FP8
NVFP4poolside/Laguna-XS-2.1-NVFP4poolside/Laguna-XS-2.1-DFlash-NVFP4
INT4poolside/Laguna-XS-2.1-INT4poolside/Laguna-XS-2.1-DFlash-INT4
The drafts themselves are small bf16 models, each calibrated against its quantized target — always pair a target with its matched draft (mixing precisions degrades accept-length). License: Apache 2.0 Resources: Hugging Face · Release blog post · API platform.

2. Configuration Tips

Attention backend Leave --attention-backend unset for High-throughput cells — auto-select is correct (fa3 on Hopper, trtllm_mha on Blackwell). With DFlash active, auto-select instead falls back to flashinfer, which breaks this hybrid-SWA model at tp ≥ 4 on Blackwell (greedy GSM8K 76% → 28%), so the Low-latency commands pin the target backend explicitly. Leave --speculative-draft-attention-backend unset. Never use triton attention with Laguna (GSM8K 13%). Quantized checkpoints cap plain TP at 4 moe_intermediate_size=512 with FP8 block [128,128] / INT4 group_size=128 scales cannot shard 8-way (512/8 = 64 < 128 granularity): FP8 fails at weight creation, INT4 crashes in the Marlin kernel, on any hardware. The generated 8-GPU FP8/INT4 commands therefore use --tp 8 --ep-size 8 — expert parallelism keeps whole experts per rank, using all 8 GPUs on one instance. FP8 additionally needs SGLANG_SHARED_EXPERT_TP1=1 (its shared expert is also block-quantized; INT4’s stays bf16). Alternatives: plain --tp 4, or --tp 4 --dp-size 2. Accuracy is parallelism-independent within eval noise (verified tp1 ≡ tp4 on GB300 and tp4 ≡ tp8+ep8 on H200). DFlash memory Low-latency cells carry --mem-fraction-static 0.7: the default fraction OOMs in the draft vocab all-gather at tp 4 on GB300. Dense cells use the default heuristic. INT4 is mixed-precision The INT4 checkpoint quantizes MoE layers in mixed 4-bit / 8-bit config groups. Builds older than PR #29761 crash at load with KeyError: 'Linear'. Chat template On transformers ≥ 5.10 the standalone chat_template.jinja auto-loads — no flag needed (the server logs Auto-detected template features: reasoning_parser=poolside_v1, ...). On older transformers (≤ ~5.8) the {% include %} stub in tokenizer_config.json cannot resolve and the server silently falls back to a generic template — pass --chat-template <model-dir>/chat_template.jinja explicitly there. Thinking Off by default; opt in per request with extra_body={"chat_template_kwargs": {"enable_thinking": True}}. The template gates on enable_thinking — the generic thinking key is ignored. Served model id The server registers the model under whatever you pass to --model-path; a client’s model field must match it (poolside/Laguna-XS-2.1, or the -FP8 / -NVFP4 / -INT4 id).

3. Advanced Usage

3.1 DFlash Speculative Decoding

DFlash is a block-wise speculative decoder: the 5-layer draft proposes a block of tokens and the target verifies the whole block in one forward pass, so only target-approved tokens are emitted — output quality is the target’s by construction (GSM8K matches dense within noise on every quantization). The speedup lever is accept-length, the number of draft tokens surviving verification per target step:
  • Measured ~6 tokens/step at tp 1, ~4 at tp 4 (greedy GSM8K, matched-precision pairs; ~3 under mixed reasoning-heavy traffic; FP8 reached 6.75 at tp 8 + ep 8 on H200) — versus 1 token/step dense.
  • Best for interactive / few-stream serving. Under batch-saturated load prefer High-throughput: once the GPU is compute-bound, draft + rejected-token overhead costs aggregate throughput.
  • The generated commands always pair the draft calibrated for the selected target precision.

3.2 Reasoning

Launch with --reasoning-parser poolside_v1 (baked into every generated command). Reasoning is opt-in via enable_thinking=True; the <think> trace lands in message.reasoning_content, separate from the final answer in message.content.
Example
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="poolside/Laguna-XS-2.1",
    messages=[{"role": "user", "content": "What is 15% of 240? Explain briefly."}],
    max_tokens=2048,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)

message = response.choices[0].message
print("=============== Reasoning ===============")
print(message.reasoning_content)
print("=============== Answer ==================")
print(message.content)
XS-2.1 is an extra-small model — give it generous max_tokens when thinking is enabled (hard problems regularly reason for thousands of tokens), and keep thinking off for short-form tasks.

3.3 Tool Calling

Launch with --tool-call-parser poolside_v1 (baked into every generated command). The parser converts Laguna’s <tool_call> output into the standard OpenAI tool_calls structure. Tool calling works with reasoning off (the default).
Example
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "The city name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["location"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="poolside/Laguna-XS-2.1",
    messages=[{"role": "user", "content": "What's the weather in Beijing?"}],
    tools=tools,
)

message = response.choices[0].message
if message.tool_calls:
    for call in message.tool_calls:
        print(f"Tool: {call.function.name}")
        print(f"Args: {call.function.arguments}")