Deployment
Install SGLang
Install SGLang
For all methods and hardware platforms, see the official SGLang installation guide. The two paths below match the Python / Docker toggle in the command panel.Then run the Python output of the command panel below in that environment.
- Python (pip / uv)
- Docker
Command
LFM2.5 support — the dense / MoE / VL model classes and the
lfm2 tool-call parser — ships on SGLang main. If your installed release predates it, install from source or use the Docker dev image.lfm2 tool-call parser and each reasoning model’s --reasoning-parser are already part of the verified command.
Panel controls (top of the command box):
- Python / Docker — bare
sglang serve …for an existing SGLang env, or adocker run … sglang serve …wrap against the dev image from the Install SGLang panel above. - ⧉ Copy — copies the current command (with whichever framing is active) to your clipboard.
- $ cURL — a sample request against
localhost:30000to confirm the server is up. - ⚙ Env — edits the placeholders (
HOST_IP,PORT,HF_TOKEN) the command and cURL share. Persists in localStorage across cookbooks. - Verified / Not Verified badge — green when the
(hw, variant, quant, strategy, nodes)combo has been run end-to-end on real hardware; yellow when auto-derived from a neighbor and not yet re-checked.
Playground
The Playground is where you experiment with SGLang features beyond the verified matrix. The Deploy panel above only emits combinations that have been signed off on; the Playground lets you turn on additional knobs on top of whichever cell the Deploy panel is currently showing. The base is read live from your Deploy selection — only your overrides change. For LFM2.5 the exposed knob is the TP override (every variant is verified at TP=1; TP=2 is available for experimentation on the larger checkpoints). The reasoning and tool-call parsers are not playground toggles here — they are variant-intrinsic and already baked into each verified command. Lines highlighted green are added by your overrides; lines with red strikethrough were in the verified base but stripped by an override. When no override differs from the base cell, the playground inherits the base’s Verified badge; any actual change flips it to Not Verified until the new configuration is run end-to-end and submitted back.Panel controls reuse Python / Docker · ⧉ Copy · $ cURL · ⚙ Env from the Deploy panel, plus one extra:
- Submit ↗ — opens a pre-filled GitHub issue so you can land your override combo as a new verified cookbook cell. Shown only while the badge says Not Verified; click it once you’ve actually run the command on your hardware and confirmed it works.
1. Model Introduction
LFM2.5 is Liquid AI’s family of hybrid models for on-device deployment, built on the LFM2 architecture with extended pre-training and large-scale reinforcement learning, released under the LFM Open License v1.0. The backbone interleaves double-gated LIV (linear input-varying) convolution blocks with a small number of GQA full-attention blocks: the convolution blocks give linear-time, low-memory sequence mixing while the periodic attention blocks preserve associative recall. Key Features:- Hybrid LIV-conv + GQA architecture: the 1.2B / 350M dense models are 16 layers (10 conv + 6 GQA); the 8B-A1B MoE is 24 layers (18 conv + 6 GQA).
- Pythonic tool calling: function calls are emitted as a Python list between
<|tool_call_start|>and<|tool_call_end|>tokens. Thelfm2tool-call parser surfaces these as standardmessage.tool_calls. - Reasoning variants: the 8B-A1B and 1.2B-Thinking checkpoints emit an explicit
<think>...</think>chain-of-thought before the answer. - Multilingual: up to 10 languages, with dedicated Japanese chat checkpoints.
- Vision: LFM2.5-VL-1.6B pairs the 1.2B language backbone with a SigLIP2 NaFlex 400M encoder for OCR, document understanding, and multilingual vision; LFM2.5-VL-450M pairs the 350M backbone with a SigLIP2 86M encoder for captioning and object detection at edge sizes.
| Model | Parameters | Context | Role |
|---|---|---|---|
| LFM2.5-8B-A1B | 8.3B total / 1.5B active (MoE) | 128K | Reasoning-tuned, agentic / tool use |
| LFM2.5-1.2B-Instruct | 1.17B (dense) | 32K | General instruct, RAG, data extraction |
| LFM2.5-1.2B-Thinking | 1.17B (dense) | 32K | Reasoning (always-on chain-of-thought) |
| LFM2.5-350M | 350M (dense) | 32K | Compact instruct, structured output |
| LFM2.5-1.2B-JP-202606 | 1.17B (dense) | 32K | Japanese chat (latest) |
| LFM2.5-1.2B-JP | 1.17B (dense) | 32K | Japanese chat (original) |
| LFM2.5-VL-1.6B | 1.2B LM + SigLIP2 400M | 32K | Vision-language (OCR, docs, multi-image) |
| LFM2.5-VL-450M | 350M LM + SigLIP2 86M | 32K | Compact vision-language (captioning, object detection) |
| LFM2.5-1.2B-Base | 1.17B (dense) | 32K | Pre-trained base (completions only) |
--tool-call-parser) and LFM2.5-1.2B-Base (no chat template — use the completions endpoint, see §3.5) launch the same way with the model path swapped.
License: LFM Open License v1.0.
Resources: Liquid AI blog, LFM docs, LFM2 Technical Report (arXiv:2511.23404).
2. Configuration Tips
- Reasoning parser: LFM2.5 reasoning models wrap their chain-of-thought in
<think>...</think>tags. The command generator passes--reasoning-parser qwen3for 8B-A1B (it emits an explicit opening<think>) and--reasoning-parser qwen3-thinkingfor 1.2B-Thinking (always-on reasoning). This splits the thinking process intoreasoning_content; without it the chain-of-thought stays inline incontent. - Tool calling:
--tool-call-parser lfm2surfaces LFM2.5’s Pythonic<|tool_call_start|>[...]<|tool_call_end|>calls as standardmessage.tool_calls. The original 1.2B-JP does not expose tool calling; Base has no chat template (use completions). - Attention backend on Blackwell (B200/sm100): SGLang defaults to the
trtllm_mhabackend on sm100, which is fastest for the dense text models. The 8B-A1B uses a mamba-style state cache that runs on a page-size-1 backend, so the generator picks--attention-backend flashinferfor it. The VL language model also uses that state cache and offers two backends:--attention-backend flashinfer(keeps prefix/radix caching — what the generator emits), or--attention-backend trtllm_mha --disable-radix-cacheto run the language model on Blackwelltrtllm_mhaattention (--disable-radix-cachelifts the page-size-1 requirement, at the cost of prefix caching). Pair either with--mm-attention-backend fa4for the vision tower. - VL vision tower (
--mm-attention-backend): on sm100 thetrtllm_mhadefault is fastest for text but applies causal attention to image tokens. For the VL model, pass--mm-attention-backend fa4on B200/B300 (orfa3on H100/H200) to restore bidirectional image-token attention and full vision quality. - VL multimodal feature transport: the generator launches the VL models with
SGLANG_USE_CUDA_IPC_TRANSPORT=1 SGLANG_USE_IPC_POOL_HANDLE_CACHE=1. The first moves the processor→scheduler image-feature handoff onto CUDA IPC instead of serializing tensors between processes; the second ships the pool handle so the scheduler opens it once and caches it, instead of opening a per-item handle on every request. On the image serving workload (1 image @ 720p, measured on VL-1.6B on H100 and B200) this pair is worth roughly 30–50% higher image throughput and 30–40% lower image TTFT vs running without them (measured on VL-1.6B, H100 and B200); decode speed (TPOT) is unaffected. - VL-450M memory headroom (
--mem-fraction-static 0.8): with the default memory fraction, the 450M’s small weights make SGLang size its static KV/mamba pools to nearly the whole GPU, leaving no headroom for image-feature tensors — under sustained concurrent image load the scheduler can crash with a CUDA OOM in the radix-cache free path. The generator caps--mem-fraction-static 0.8for VL-450M; the pool is still far larger than this model ever needs. - Mamba scheduling: LFM2.5 runs on the default
no_buffermamba scheduler strategy — no--mamba-scheduler-strategyflag is needed. Theextra_bufferstrategy (an overlap-scheduling throughput optimization available for some Gated-DeltaNet hybrids) does not apply to LFM2.5, whose convolution blocks usemamba_chunk_size=1. - Hardware requirements: all LFM2.5 models run on a single GPU (TP=1) on either Hopper or Blackwell. The 1.2B / 350M dense models fit in a few GB; the 8B-A1B MoE needs roughly 16 GB for bf16 weights plus KV cache. Multi-GPU tensor parallelism is not required for any variant.
generation_config.json, so the server will not apply them for you. top_k, min_p, and repetition_penalty are not standard OpenAI chat.completions fields — pass them through extra_body and SGLang forwards them to its sampler. Do not set max_tokens unless you intend to cap output, as it can truncate a response (or a reasoning model’s chain-of-thought) mid-stream.
| Model | temperature | extra_body (sampler) |
|---|---|---|
| LFM2.5-8B-A1B | 0.2 | |
| LFM2.5-1.2B-Instruct | 0.1 | |
| LFM2.5-1.2B-Thinking | 0.05 | |
| LFM2.5-350M | 0.1 | |
| LFM2.5-1.2B-JP-202606 | 0.1 | |
| LFM2.5-1.2B-JP | 0.3 | |
| LFM2.5-VL-1.6B (text) | 0.1 | |
| LFM2.5-VL-450M (text) | 0.1 | |
| LFM2.5-1.2B-Base | 0.3 |
3. Advanced Usage
3.1 Basic Usage
A single client with the recommended sampling presets applied per model (the examples in the following sections reuse thischat helper):
Example
3.2 Reasoning
The 8B-A1B and 1.2B-Thinking checkpoints emit chain-of-thought as a built-in behavior. The Deploy panel launches them with the matching--reasoning-parser, which separates the thinking process into reasoning_content:
Example
3.3 Tool Calling
LFM2.5 writes Pythonic tool calls. With--tool-call-parser lfm2 (already part of the launch command) they are surfaced as standard message.tool_calls:
Example
3.4 Vision Input
The VL models (VL-1.6B and VL-450M) accept images via standard OpenAI multimodal content blocks. Base64 data URIs (data:image/jpeg;base64,...) work in place of a URL:
Example
3.5 Base Checkpoint
LFM2.5-1.2B-Base has no chat template — use the completions endpoint:Example
