TTS Model Usage#

This guide uses Fish Speech S2-Pro as an example TTS (text-to-speech) model with SGLang-Omni and the OpenAI-compatible API.

Hugging Face assets below use the hf download command from huggingface-hub (the old huggingface-cli download name is deprecated). For datasets, you must pass --repo-type dataset; the default is model, which makes those repos 404.

Prerequisites#

docker pull frankleeeee/sglang-omni:dev
docker run -it --shm-size 32g --gpus all frankleeeee/sglang-omni:dev /bin/zsh
git clone https://github.com/sgl-project/sglang-omni.git
cd sglang-omni
uv venv .venv -p 3.12 && source .venv/bin/activate
uv pip install -e ".[s2pro]"
hf download fishaudio/s2-pro

Launch the Server#

sgl-omni serve \
  --model-path fishaudio/s2-pro \
  --config examples/configs/s2pro_tts.yaml \
  --port 8000

Use Curl#

Generate speech from text without any reference audio:

curl -X POST http://localhost:8000/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{"input": "Hello, how are you?"}' \
    --output output.wav

Note that without reference audio, the generated voice will sound robotic. For natural-sounding results, use Voice Cloning with a reference audio clip.

Voice Cloning#

The curl examples below use a fixed clip from seed-tts-eval-mini:

LOCAL_DIR="./seed-tts-eval-mini"
hf download --repo-type dataset --local-dir "$LOCAL_DIR" \
  zhaochenyang20/seed-tts-eval-mini \
  en/prompt-wavs/common_voice_en_10119832.wav
REF_WAV="$(cd "$LOCAL_DIR/en/prompt-wavs" && pwd)/common_voice_en_10119832.wav"
  1. Non-streaming request

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Get the trust fund to the bank early.", "references": [{"audio_path": "'"$REF_WAV"'", "text": "We asked over twenty different people, and they all said it was his."}]}' \
  --output output.wav

The references field lists audio_path and text (transcript of that audio).

  1. Streaming

Enable streaming to receive audio chunks in real time via Server-Sent Events (SSE). Set "stream": true:

curl -N -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Get the trust fund to the bank early.", "references": [{"audio_path": "'"$REF_WAV"'", "text": "We asked over twenty different people, and they all said it was his."}], "stream": true}'

The server returns a stream of SSE events. Each event contains an audio.speech.chunk object with a base64-encoded audio chunk. The stream ends with data: [DONE].

Use Python#

Basic TTS#

import requests

resp = requests.post(
    "http://localhost:8000/v1/audio/speech",
    json={"input": "Hello, how are you?"},
)
resp.raise_for_status()
with open("output.wav", "wb") as f:
    f.write(resp.content)

Voice Cloning#

Use the same wav as in the curl section (after hf download).

from pathlib import Path

ref_path = (
    Path("seed-tts-eval-mini") / "en" / "prompt-wavs" / "common_voice_en_10119832.wav"
).resolve()

SPEECH_INPUT = "Get the trust fund to the bank early."
REFERENCE_TEXT = "We asked over twenty different people, and they all said it was his."
  1. Non-streaming Request

import requests

resp = requests.post(
    "http://localhost:8000/v1/audio/speech",
    json={
        "input": SPEECH_INPUT,
        "references": [{"audio_path": str(ref_path), "text": REFERENCE_TEXT}],
    },
)
resp.raise_for_status()
with open("output.wav", "wb") as f:
    f.write(resp.content)
  1. Streaming Request

import base64, io, json, wave

import requests

payload = {
    "input": SPEECH_INPUT,
    "references": [{"audio_path": str(ref_path), "text": REFERENCE_TEXT}],
    "stream": True,
    "response_format": "wav",
}

chunks = []
fmt = None
with requests.post(
    "http://localhost:8000/v1/audio/speech",
    json=payload,
    stream=True,
    timeout=600,
) as stream:
    stream.raise_for_status()
    for line in stream.iter_lines(decode_unicode=True):
        if not line or not line.startswith("data: "):
            continue
        data = line[len("data:"):].lstrip()
        if data == "[DONE]":
            break
        b64 = (json.loads(data).get("audio") or {}).get("data")
        if not b64:
            continue
        with wave.open(io.BytesIO(base64.b64decode(b64)), "rb") as w:
            if fmt is None:
                fmt = w.getnchannels(), w.getsampwidth(), w.getframerate()
            chunks.append(w.readframes(w.getnframes()))

assert fmt
nc, sw, fr = fmt
with wave.open("output_stream.wav", "wb") as w:
    w.setnchannels(nc)
    w.setsampwidth(sw)
    w.setframerate(fr)
    w.writeframes(b"".join(chunks))

Request Parameters#

The table below lists all parameters accepted by the /v1/audio/speech endpoint.

Parameter

Type

Default

Description

input

string

(required)

Text to synthesize

voice

string

"default"

Voice identifier

response_format

string

"wav"

Output audio format

speed

float

1.0

Playback speed multiplier

stream

bool

false

Enable streaming via SSE

references

list

null

Reference audio for voice cloning; each item has audio_path and text

max_new_tokens

int

null

Maximum number of generated tokens

temperature

float

null

Sampling temperature

top_p

float

null

Top-p sampling

top_k

int

null

Top-k sampling

repetition_penalty

float

null

Repetition penalty

seed

int

null

Random seed for reproducibility

Interactive Playground#

SGLang-Omni ships with a Gradio-based playground for interactive TTS experimentation:

./playground/tts/start.sh

A demo play video is available here. We highly recommend using playground since audio data is hard to intertact with by CLI.