Documentation Index
Fetch the complete documentation index at: https://docs.sglang.io/llms.txt
Use this file to discover all available pages before exploring further.
1. Model Introduction
Laguna-XS.2 is an open-source hybrid sliding-window-attention MoE model from Poolside, built for agentic coding and long-horizon software engineering work. Key Features:- MoE: 33.4B total parameters, 3.0B active per token, 256 routed experts (top-8) plus 1 shared.
- Long context: 131,072 tokens.
- Agentic coding: Tuned for tool-using software engineering agents and long-horizon execution.
- Hybrid reasoning:
<think>...</think>segments toggled per request viachat_template_kwargs={"enable_thinking": ...}.
| Variant | Hugging Face path |
|---|---|
| BF16 | poolside/Laguna-XS.2 |
| FP8 | poolside/Laguna-XS.2-FP8 |
| NVFP4 | poolside/Laguna-XS.2-NVFP4 |
2. SGLang Installation
Laguna-XS.2 support is onmain but not yet in a tagged release; install from the SGLang nightly wheel index, or pull a pre-built Docker image:
Command
3. Model Deployment
3.1 Basic Configuration
Interactive Command Generator: Use the configuration selector below to generate a launch command for your hardware.3.2 Configuration Tips
- Quantization: NVFP4 requires Blackwell (B200 / B300); BF16 and FP8 run on either H200 or B200. FP8’s first launch triggers a multi-session DeepGEMM JIT pre-compile (~10-20 min); pre-warm with
python3 -m sglang.compile_deep_gemm --model poolside/Laguna-XS.2-FP8to avoid that cost on every restart. - Reasoning parser (
--reasoning-parser poolside_v1): Splits<think>...</think>segments intoreasoning_contentsocontentholds only the final answer. Disable only if you want the raw<think>tags incontent. - Tool call parser (
--tool-call-parser poolside_v1): Required for OpenAI-compatible tool-call streaming. Disable only for chat-only deployments. - DP attention: For higher-throughput deployments, enable the DP-Attention toggle — it emits
--dp <N> --enable-dp-attentionwith--dpmatching--tp(tune independently if needed). - Thinking default: Thinking is off by default at the model level. Opt in per request with
extra_body={"chat_template_kwargs": {"enable_thinking": True}}.
4. Model Invocation
The samples below assume the server is reachable athttp://localhost:30000/v1.
4.1 Basic Chat
Example
Output
4.2 Reasoning (Thinking Mode)
Laguna-XS.2 emits reasoning between<think>...</think> tags. The --reasoning-parser poolside_v1 flag separates the thinking text into reasoning_content so content holds only the final answer. Thinking is opt-in per request:
Example
Output
extra_body (off by default) or pass chat_template_kwargs={"enable_thinking": False} explicitly.
4.3 Tool Calling
Example
Output
reasoning_content is None because thinking is off by default; content carries the brief assistant message that precedes the tool call. Add extra_body={"chat_template_kwargs": {"enable_thinking": True}} if you want interleaved reasoning before the tool call.
5. Benchmark
5.1 Accuracy Benchmark
Test Environment:- Hardware: NVIDIA H200 (4×H200)
- Model:
poolside/Laguna-XS.2(BF16) - Tensor Parallelism: 4
- SGLang Version:
0.5.12.dev20260509+g096ad02b0(nightly wheel containing the #24204 merge commit; same code path as the original PR runs) - Reasoning Parser:
poolside_v1 - Tool Call Parser:
poolside_v1 - Sampling:
temperature=0.6,max_tokens=16384,chat_template_kwargs={"enable_thinking": true},n_repeats=1 - Grader: NeMo-Skills
math_verify(math) andeval_mcq(multichoice)
| Eval | Accuracy |
|---|---|
| GPQA Diamond | 0.5556 |
| AIME 25 | 0.5667 |
| MMLU | 0.836 |
| SWE-Bench Verified | 0.6540 |
5.2 Speed Benchmark
Test Environment:- Hardware: NVIDIA H200 (1×H200 for TP=1, 4×H200 for TP=4)
- Model:
poolside/Laguna-XS.2(BF16) - SGLang Version:
0.5.12.dev20260509+g096ad02b0(nightly wheel containing the #24204 merge commit; same code path as the original PR runs) - Workload:
sglang.bench_serving --backend sglang --dataset-name random(defaults:--random-input-len 1024 --random-output-len 1024 --random-range-ratio 0.0) - Server flags identical to the accuracy runs above.
5.2.1 Latency Benchmark (10 prompts, concurrency = 1)
Command
| Metric | TP=1 | TP=4 |
|---|---|---|
| Successful requests | 10 | 10 |
| Output token throughput (tok/s) | 193.10 | 238.88 |
| Total token throughput (tok/s) | 471.82 | 583.68 |
| Mean TTFT (ms) | 35.32 | 24.17 |
| Mean TPOT (ms) | 5.10 | 4.13 |
| Median ITL (ms) | 5.14 | 4.14 |
5.2.2 Throughput Benchmark (1000 prompts, concurrency = 100)
Command
| Metric | TP=1 | TP=4 |
|---|---|---|
| Successful requests | 1000 | 1000 |
| Request throughput (req/s) | 7.32 | 14.61 |
| Output token throughput (tok/s) | 3739.30 | 7465.18 |
| Peak output token throughput (tok/s) | 4718.00 | 10133.00 |
| Total token throughput (tok/s) | 7485.82 | 14944.81 |
| Mean TTFT (ms) | 115.17 | 68.36 |
| Mean TPOT (ms) | 25.51 | 12.71 |
| Median ITL (ms) | 21.31 | 10.64 |
cc=100 random workload.