Deployment
Install SGLang
Install SGLang
Laguna-M.1 support is already on SGLang Then run the Python output of the command panel below in that environment. The Docker tab is simpler — its image (
main — softplus per-element attention-output gating (PR #28400) and a global-attention fix (PR #28604, since M.1 is full-attention sliding_window: 0) — but not yet in a tagged release. The two paths below match the Python / Docker toggle in the command panel: install from main (Python tab), or use the Docker image, which bundles the same build (CUDA 13, covers H200 + all Blackwell). The model loads natively, so no --trust-remote-code is needed.- Python (pip / uv)
- Docker
Command
dev-cu13-618-nightly) bundles the CUDA-13 runtime and the M.1 code. Once M.1 support lands in a tagged release, uv pip install sglang will pull it directly.--tp 8; the 4-GPU Grace-Blackwell single nodes (GB200 / GB300) use --tp 4.
Playground
The Playground is where you experiment with SGLang features beyond the verified matrix. The Deploy panel above only emits combinations the SGLang team has signed off on; the Playground lets you turn on additional knobs (parsers, DP-Attention, DeepEP / EP) on top of whichever cell the Deploy panel is currently showing.1. Model Introduction
Laguna-M.1 is an open-weight, 225B-parameter Mixture-of-Experts model (23B activated per token) from poolside, built for agentic coding and long-horizon software-engineering work. It is released under Apache 2.0. Key Features:- Large sparse MoE: 70-layer transformer — the first 3 layers are dense SwiGLU, the remaining 67 are sparse MoE with 256 experts, top-16 routing (+1 shared expert) and auxiliary-loss-free load balancing.
- Global attention with output gating: global attention across all layers, 64 Q-heads / 8 KV-heads (head dim 128), with softplus attention output gating (requires PR #28400).
- Long context: 262,144 tokens, RoPE with YaRN.
- Agentic coding: competitive on SWE-bench Verified, SWE-bench Multilingual, SWE-Bench Pro, and Terminal-Bench 2.0.
- Native reasoning: interleaved thinking between tool calls, toggled per request via
chat_template_kwargs={"enable_thinking": ...}.
| Quantization | Hugging Face path |
|---|---|
| BF16 | poolside/Laguna-M.1 |
| FP8 | poolside/Laguna-M.1-FP8 |
| NVFP4 | poolside/Laguna-M.1-NVFP4 |
2. Configuration Tips
- Long-context memory: M.1 is global-attention (no sliding-window), so the 262,144-token KV cache is large. If you hit OOM at full context, lower
--mem-fraction-staticor cap--context-length. - FP8: On Blackwell the recipe adds
--fp8-gemm-backend triton— the compressed-tensors block-FP8 weight scales aren’t UE8M0-packed, so the default DeepGEMM path emits garbage on Blackwell (sm_100); the Triton backend is correct (~19% slower). Temporary workaround pending PR #28662 (which fixes the scales and restores the DeepGEMM fast path). On Hopper (H200) FP8 uses DeepGEMM with no extra flag — pre-warm its multi-session JIT withpython3 -m sglang.compile_deep_gemm --model poolside/Laguna-M.1-FP8to avoid paying it on each restart. - Parsers (
poolside_v1): for agentic / tool-using deployments enable the Reasoning Parser and Tool Call Parser in the Playground above — they emit--reasoning-parser poolside_v1(thinking →reasoning_content) and--tool-call-parser poolside_v1(structuredtool_calls). - Thinking default: thinking is off by default; opt in per request with
extra_body={"chat_template_kwargs": {"enable_thinking": True}}. - Served model id: the server registers the model under whatever you pass to
--model-path, so a client’smodelfield must match it —poolside/Laguna-M.1(BF16) orpoolside/Laguna-M.1-FP8/-NVFP4for the quantized cells. The §3 examples use the BF16 id; swap in the id you launched. - Recommended sampling: poolside benchmarks M.1 at
temperature=1.0,top_k=20with thinking enabled. These are per-request sampling params (not launch flags) — e.g.temperature=1.0, extra_body={"top_k": 20}on the OpenAI client.
3. Advanced Usage
3.1 Reasoning
Launch with--reasoning-parser poolside_v1 (or toggle Reasoning Parser in the Parsers card of the Playground above). Reasoning is opt-in: the Laguna chat template gates it on enable_thinking=True (passed via chat_template_kwargs) — the generic thinking key is ignored. The <think> trace then lands in message.reasoning_content, separate from the final answer in message.content — no client-side tag stripping needed.
Reasoning Example (Python)
Reasoning Example (Python)
Example
Example Output
Example Output
Output
Laguna-M.1’s reasoning traces are long — the model explores and re-verifies an answer
multiple ways. Give it a generous
max_tokens for harder problems (reasoning regularly
exceeds 3k tokens). The trace above is abbreviated; the model emits it in full.3.2 Tool Calling
Launch with--tool-call-parser poolside_v1 (or toggle Tool Call Parser in the Parsers card of the Playground above). The parser converts Laguna’s <tool_call> output into the standard OpenAI tool_calls structure. Tool calling works with reasoning off (enable_thinking=False, the default).
Tool Calling Example (Python)
Tool Calling Example (Python)
Example
Example Output
Example Output
Output
