Skip to main content

Deployment

For all methods and hardware platforms, see the official SGLang installation guide. The two paths below match the Python / Docker toggle in the command panel.
Command
pip install --upgrade pip
pip install uv
uv pip install sglang
LFM2.5 support — the dense / MoE / VL model classes and the lfm2 tool-call parser — ships on SGLang main. If your installed release predates it, install from source or use the Docker dev image.
Then run the Python output of the command panel below in that environment.
Every LFM2.5 model runs on a single GPU (TP=1) — pick your hardware + model variant to generate the launch command. One recipe covers all operating points per variant; the commands differ only by the parsers a model needs and, on Blackwell, the attention backend. The lfm2 tool-call parser and each reasoning model’s --reasoning-parser are already part of the verified command.

Panel controls (top of the command box):

  • Python / Docker — bare sglang serve … for an existing SGLang env, or a docker run … sglang serve … wrap against the dev image from the Install SGLang panel above.
  • ⧉ Copy — copies the current command (with whichever framing is active) to your clipboard.
  • $ cURL — a sample request against localhost:30000 to confirm the server is up.
  • ⚙ Env — edits the placeholders (HOST_IP, PORT, HF_TOKEN) the command and cURL share. Persists in localStorage across cookbooks.
  • Verified / Not Verified badge — green when the (hw, variant, quant, strategy, nodes) combo has been run end-to-end on real hardware; yellow when auto-derived from a neighbor and not yet re-checked.

Playground

The Playground is where you experiment with SGLang features beyond the verified matrix. The Deploy panel above only emits combinations that have been signed off on; the Playground lets you turn on additional knobs on top of whichever cell the Deploy panel is currently showing. The base is read live from your Deploy selection — only your overrides change. For LFM2.5 the exposed knob is the TP override (every variant is verified at TP=1; TP=2 is available for experimentation on the larger checkpoints). The reasoning and tool-call parsers are not playground toggles here — they are variant-intrinsic and already baked into each verified command. Lines highlighted green are added by your overrides; lines with red strikethrough were in the verified base but stripped by an override. When no override differs from the base cell, the playground inherits the base’s Verified badge; any actual change flips it to Not Verified until the new configuration is run end-to-end and submitted back.

Panel controls reuse Python / Docker · ⧉ Copy · $ cURL · ⚙ Env from the Deploy panel, plus one extra:

  • Submit ↗ — opens a pre-filled GitHub issue so you can land your override combo as a new verified cookbook cell. Shown only while the badge says Not Verified; click it once you’ve actually run the command on your hardware and confirmed it works.

1. Model Introduction

LFM2.5 is Liquid AI’s family of hybrid models for on-device deployment, built on the LFM2 architecture with extended pre-training and large-scale reinforcement learning, released under the LFM Open License v1.0. The backbone interleaves double-gated LIV (linear input-varying) convolution blocks with a small number of GQA full-attention blocks: the convolution blocks give linear-time, low-memory sequence mixing while the periodic attention blocks preserve associative recall. Key Features:
  • Hybrid LIV-conv + GQA architecture: the 1.2B / 350M dense models are 16 layers (10 conv + 6 GQA); the 8B-A1B MoE is 24 layers (18 conv + 6 GQA).
  • Pythonic tool calling: function calls are emitted as a Python list between <|tool_call_start|> and <|tool_call_end|> tokens. The lfm2 tool-call parser surfaces these as standard message.tool_calls.
  • Reasoning variants: the 8B-A1B and 1.2B-Thinking checkpoints emit an explicit <think>...</think> chain-of-thought before the answer.
  • Multilingual: up to 10 languages, with dedicated Japanese chat checkpoints.
  • Vision: LFM2.5-VL-1.6B pairs the 1.2B language backbone with a SigLIP2 NaFlex 400M encoder for OCR, document understanding, and multilingual vision; LFM2.5-VL-450M pairs the 350M backbone with a SigLIP2 86M encoder for captioning and object detection at edge sizes.
Available Models:
ModelParametersContextRole
LFM2.5-8B-A1B8.3B total / 1.5B active (MoE)128KReasoning-tuned, agentic / tool use
LFM2.5-1.2B-Instruct1.17B (dense)32KGeneral instruct, RAG, data extraction
LFM2.5-1.2B-Thinking1.17B (dense)32KReasoning (always-on chain-of-thought)
LFM2.5-350M350M (dense)32KCompact instruct, structured output
LFM2.5-1.2B-JP-2026061.17B (dense)32KJapanese chat (latest)
LFM2.5-1.2B-JP1.17B (dense)32KJapanese chat (original)
LFM2.5-VL-1.6B1.2B LM + SigLIP2 400M32KVision-language (OCR, docs, multi-image)
LFM2.5-VL-450M350M LM + SigLIP2 86M32KCompact vision-language (captioning, object detection)
LFM2.5-1.2B-Base1.17B (dense)32KPre-trained base (completions only)
The Deploy panel above covers the seven serving variants; LFM2.5-1.2B-JP (original — launch without --tool-call-parser) and LFM2.5-1.2B-Base (no chat template — use the completions endpoint, see §3.5) launch the same way with the model path swapped. License: LFM Open License v1.0. Resources: Liquid AI blog, LFM docs, LFM2 Technical Report (arXiv:2511.23404).

2. Configuration Tips

  • Reasoning parser: LFM2.5 reasoning models wrap their chain-of-thought in <think>...</think> tags. The command generator passes --reasoning-parser qwen3 for 8B-A1B (it emits an explicit opening <think>) and --reasoning-parser qwen3-thinking for 1.2B-Thinking (always-on reasoning). This splits the thinking process into reasoning_content; without it the chain-of-thought stays inline in content.
  • Tool calling: --tool-call-parser lfm2 surfaces LFM2.5’s Pythonic <|tool_call_start|>[...]<|tool_call_end|> calls as standard message.tool_calls. The original 1.2B-JP does not expose tool calling; Base has no chat template (use completions).
  • Attention backend on Blackwell (B200/sm100): SGLang defaults to the trtllm_mha backend on sm100, which is fastest for the dense text models. The 8B-A1B uses a mamba-style state cache that runs on a page-size-1 backend, so the generator picks --attention-backend flashinfer for it. The VL language model also uses that state cache and offers two backends: --attention-backend flashinfer (keeps prefix/radix caching — what the generator emits), or --attention-backend trtllm_mha --disable-radix-cache to run the language model on Blackwell trtllm_mha attention (--disable-radix-cache lifts the page-size-1 requirement, at the cost of prefix caching). Pair either with --mm-attention-backend fa4 for the vision tower.
  • VL vision tower (--mm-attention-backend): on sm100 the trtllm_mha default is fastest for text but applies causal attention to image tokens. For the VL model, pass --mm-attention-backend fa4 on B200/B300 (or fa3 on H100/H200) to restore bidirectional image-token attention and full vision quality.
  • VL multimodal feature transport: the generator launches the VL models with SGLANG_USE_CUDA_IPC_TRANSPORT=1 SGLANG_USE_IPC_POOL_HANDLE_CACHE=1. The first moves the processor→scheduler image-feature handoff onto CUDA IPC instead of serializing tensors between processes; the second ships the pool handle so the scheduler opens it once and caches it, instead of opening a per-item handle on every request. On the image serving workload (1 image @ 720p, measured on VL-1.6B on H100 and B200) this pair is worth roughly 30–50% higher image throughput and 30–40% lower image TTFT vs running without them (measured on VL-1.6B, H100 and B200); decode speed (TPOT) is unaffected.
  • VL-450M memory headroom (--mem-fraction-static 0.8): with the default memory fraction, the 450M’s small weights make SGLang size its static KV/mamba pools to nearly the whole GPU, leaving no headroom for image-feature tensors — under sustained concurrent image load the scheduler can crash with a CUDA OOM in the radix-cache free path. The generator caps --mem-fraction-static 0.8 for VL-450M; the pool is still far larger than this model ever needs.
  • Mamba scheduling: LFM2.5 runs on the default no_buffer mamba scheduler strategy — no --mamba-scheduler-strategy flag is needed. The extra_buffer strategy (an overlap-scheduling throughput optimization available for some Gated-DeltaNet hybrids) does not apply to LFM2.5, whose convolution blocks use mamba_chunk_size=1.
  • Hardware requirements: all LFM2.5 models run on a single GPU (TP=1) on either Hopper or Blackwell. The 1.2B / 350M dense models fit in a few GB; the 8B-A1B MoE needs roughly 16 GB for bf16 weights plus KV cache. Multi-GPU tensor parallelism is not required for any variant.
Recommended sampling parameters — pass these explicitly on every request. Some LFM2.5 checkpoints do not ship sampling defaults in generation_config.json, so the server will not apply them for you. top_k, min_p, and repetition_penalty are not standard OpenAI chat.completions fields — pass them through extra_body and SGLang forwards them to its sampler. Do not set max_tokens unless you intend to cap output, as it can truncate a response (or a reasoning model’s chain-of-thought) mid-stream.
Modeltemperatureextra_body (sampler)
LFM2.5-8B-A1B0.2
LFM2.5-1.2B-Instruct0.1
LFM2.5-1.2B-Thinking0.05
LFM2.5-350M0.1
LFM2.5-1.2B-JP-2026060.1
LFM2.5-1.2B-JP0.3
LFM2.5-VL-1.6B (text)0.1
LFM2.5-VL-450M (text)0.1
LFM2.5-1.2B-Base0.3

3. Advanced Usage

3.1 Basic Usage

A single client with the recommended sampling presets applied per model (the examples in the following sections reuse this chat helper):
Example
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

# Non-OpenAI fields (top_k / min_p / repetition_penalty) ride in extra_body.
SAMPLING = {
    "LiquidAI/LFM2.5-8B-A1B":         dict(temperature=0.2,  extra_body={"top_k": 80, "repetition_penalty": 1.05}),
    "LiquidAI/LFM2.5-1.2B-Instruct":  dict(temperature=0.1,  extra_body={"top_k": 50, "repetition_penalty": 1.05}),
    "LiquidAI/LFM2.5-1.2B-Thinking":  dict(temperature=0.05, extra_body={"top_k": 50, "repetition_penalty": 1.05}),
    "LiquidAI/LFM2.5-350M":           dict(temperature=0.1,  extra_body={"top_k": 50, "repetition_penalty": 1.05}),
    "LiquidAI/LFM2.5-1.2B-JP-202606": dict(temperature=0.1,  extra_body={"top_k": 50, "repetition_penalty": 1.05}),
    "LiquidAI/LFM2.5-VL-1.6B":        dict(temperature=0.1,  extra_body={"min_p": 0.15, "repetition_penalty": 1.05}),
    "LiquidAI/LFM2.5-VL-450M":        dict(temperature=0.1,  extra_body={"min_p": 0.15, "repetition_penalty": 1.05}),
}

def chat(model, messages, **overrides):
    cfg = SAMPLING[model]
    body = cfg["extra_body"] | overrides.pop("extra_body", {})
    return client.chat.completions.create(
        model=model, messages=messages,
        temperature=cfg["temperature"], extra_body=body, **overrides,
    )

resp = chat(
    "LiquidAI/LFM2.5-1.2B-Instruct",
    [{"role": "user", "content": "What is C. elegans? Answer in one sentence."}],
)
print(resp.choices[0].message.content)

3.2 Reasoning

The 8B-A1B and 1.2B-Thinking checkpoints emit chain-of-thought as a built-in behavior. The Deploy panel launches them with the matching --reasoning-parser, which separates the thinking process into reasoning_content:
Example
resp = chat(
    "LiquidAI/LFM2.5-8B-A1B",
    [{"role": "user", "content": "If a train travels 60 km/h for 2.5 hours, how far does it go?"}],
)
msg = resp.choices[0].message
print("Reasoning:", msg.reasoning_content)
print("Answer:", msg.content)

3.3 Tool Calling

LFM2.5 writes Pythonic tool calls. With --tool-call-parser lfm2 (already part of the launch command) they are surfaced as standard message.tool_calls:
Example
resp = chat(
    "LiquidAI/LFM2.5-1.2B-Instruct",
    [{"role": "user", "content": "What's the weather in Paris?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {"location": {"type": "string"}},
                "required": ["location"],
            },
        },
    }],
)
for call in resp.choices[0].message.tool_calls or []:
    print(call.function.name, call.function.arguments)
Tool calling is supported on 8B-A1B, 1.2B-Thinking, 1.2B-Instruct, 350M, 1.2B-JP-202606, VL-1.6B, and VL-450M. For the VL models it is text-turn-only — do not combine an image and tools in the same turn.

3.4 Vision Input

The VL models (VL-1.6B and VL-450M) accept images via standard OpenAI multimodal content blocks. Base64 data URIs (data:image/jpeg;base64,...) work in place of a URL:
Example
resp = chat(
    "LiquidAI/LFM2.5-VL-1.6B",
    [{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {
                "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"}},
            {"type": "text", "text": "What is in this image?"},
        ],
    }],
)
print(resp.choices[0].message.content)

3.5 Base Checkpoint

LFM2.5-1.2B-Base has no chat template — use the completions endpoint:
Example
comp = client.completions.create(
    model="LiquidAI/LFM2.5-1.2B-Base",
    prompt="The capital of France is",
    temperature=0.3,
    extra_body={"min_p": 0.15, "repetition_penalty": 1.05},
)
print(comp.choices[0].text)