TTS Model Usage#

This guide uses Fish Speech S2-Pro as an example TTS (text-to-speech) model with SGLang-Omni and the OpenAI-compatible API.

Hugging Face assets below use the hf download command from huggingface-hub (the old huggingface-cli download name is deprecated). For datasets, you must pass --repo-type dataset; the default is model, which makes those repos 404.

Prerequisites#

docker pull frankleeeee/sglang-omni:dev
docker run -it --shm-size 32g --gpus all frankleeeee/sglang-omni:dev /bin/zsh

git clone https://github.com/sgl-project/sglang-omni.git
cd sglang-omni
uv venv .venv -p 3.12 && source .venv/bin/activate
uv pip install -e ".[s2pro]"
hf download fishaudio/s2-pro

Launch the Server#

sgl-omni serve \
  --model-path fishaudio/s2-pro \
  --config examples/configs/s2pro_tts.yaml \
  --port 8000

Use Curl#

Generate speech from text without any reference audio:

curl -X POST http://localhost:8000/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{"input": "Hello, how are you?"}' \
    --output output.wav

Note that without reference audio, the generated voice will sound robotic. For natural-sounding results, use Voice Cloning with a reference audio clip.

Voice Cloning#

The curl examples below use a fixed clip from seed-tts-eval-mini:

LOCAL_DIR="./seed-tts-eval-mini"
hf download --repo-type dataset --local-dir "$LOCAL_DIR" \
  zhaochenyang20/seed-tts-eval-mini \
  en/prompt-wavs/common_voice_en_10119832.wav
REF_WAV="$(cd "$LOCAL_DIR/en/prompt-wavs" && pwd)/common_voice_en_10119832.wav"

Non-streaming request

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Get the trust fund to the bank early.", "references": [{"audio_path": "'"$REF_WAV"'", "text": "We asked over twenty different people, and they all said it was his."}]}' \
  --output output.wav

The references field lists audio_path and text (transcript of that audio).

Streaming

Enable streaming to receive audio chunks in real time via Server-Sent Events (SSE). Set "stream": true:

curl -N -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Get the trust fund to the bank early.", "references": [{"audio_path": "'"$REF_WAV"'", "text": "We asked over twenty different people, and they all said it was his."}], "stream": true}'

The server returns a stream of SSE events. Each event contains an audio.speech.chunk object with a base64-encoded audio chunk. The stream ends with data: [DONE].

Use Python#

Basic TTS#

import requests

resp = requests.post(
    "http://localhost:8000/v1/audio/speech",
    json={"input": "Hello, how are you?"},
)
resp.raise_for_status()
with open("output.wav", "wb") as f:
    f.write(resp.content)

Voice Cloning#

Use the same wav as in the curl section (after hf download).

from pathlib import Path

ref_path = (
    Path("seed-tts-eval-mini") / "en" / "prompt-wavs" / "common_voice_en_10119832.wav"
).resolve()

SPEECH_INPUT = "Get the trust fund to the bank early."
REFERENCE_TEXT = "We asked over twenty different people, and they all said it was his."

Non-streaming Request

import requests

resp = requests.post(
    "http://localhost:8000/v1/audio/speech",
    json={
        "input": SPEECH_INPUT,
        "references": [{"audio_path": str(ref_path), "text": REFERENCE_TEXT}],
    },
)
resp.raise_for_status()
with open("output.wav", "wb") as f:
    f.write(resp.content)

Streaming Request

import base64, io, json, wave

import requests

payload = {
    "input": SPEECH_INPUT,
    "references": [{"audio_path": str(ref_path), "text": REFERENCE_TEXT}],
    "stream": True,
    "response_format": "wav",
}

chunks = []
fmt = None
with requests.post(
    "http://localhost:8000/v1/audio/speech",
    json=payload,
    stream=True,
    timeout=600,
) as stream:
    stream.raise_for_status()
    for line in stream.iter_lines(decode_unicode=True):
        if not line or not line.startswith("data: "):
            continue
        data = line[len("data:"):].lstrip()
        if data == "[DONE]":
            break
        b64 = (json.loads(data).get("audio") or {}).get("data")
        if not b64:
            continue
        with wave.open(io.BytesIO(base64.b64decode(b64)), "rb") as w:
            if fmt is None:
                fmt = w.getnchannels(), w.getsampwidth(), w.getframerate()
            chunks.append(w.readframes(w.getnframes()))

assert fmt
nc, sw, fr = fmt
with wave.open("output_stream.wav", "wb") as w:
    w.setnchannels(nc)
    w.setsampwidth(sw)
    w.setframerate(fr)
    w.writeframes(b"".join(chunks))

Request Parameters#

The table below lists all parameters accepted by the /v1/audio/speech endpoint.

Parameter	Type	Default	Description
`input`	string	(required)	Text to synthesize
`voice`	string	`"default"`	Voice identifier
`response_format`	string	`"wav"`	Output audio format
`speed`	float	`1.0`	Playback speed multiplier
`stream`	bool	`false`	Enable streaming via SSE
`references`	list	`null`	Reference audio for voice cloning; each item has `audio_path` and `text`
`max_new_tokens`	int	`null`	Maximum number of generated tokens
`temperature`	float	`null`	Sampling temperature
`top_p`	float	`null`	Top-p sampling
`top_k`	int	`null`	Top-k sampling
`repetition_penalty`	float	`null`	Repetition penalty
`seed`	int	`null`	Random seed for reproducibility

Interactive Playground#

SGLang-Omni ships with a Gradio-based playground for interactive TTS experimentation:

./playground/tts/start.sh

A demo play video is available here. We highly recommend using playground since audio data is hard to intertact with by CLI.

TTS Model Usage

Contents

TTS Model Usage#

Prerequisites#

Launch the Server#

Use Curl#

Voice Cloning#

Use Python#

Basic TTS#

Voice Cloning#

Request Parameters#

Interactive Playground#