TTS Model Usage#
This guide uses Fish Speech S2-Pro as an example TTS (text-to-speech) model with SGLang-Omni and the OpenAI-compatible API.
Hugging Face assets below use the hf download command from huggingface-hub (the old huggingface-cli download name is deprecated). For datasets, you must pass --repo-type dataset; the default is model, which makes those repos 404.
Prerequisites#
docker pull frankleeeee/sglang-omni:dev
docker run -it --shm-size 32g --gpus all frankleeeee/sglang-omni:dev /bin/zsh
git clone https://github.com/sgl-project/sglang-omni.git
cd sglang-omni
uv venv .venv -p 3.12 && source .venv/bin/activate
uv pip install -e ".[s2pro]"
hf download fishaudio/s2-pro
Launch the Server#
sgl-omni serve \
--model-path fishaudio/s2-pro \
--config examples/configs/s2pro_tts.yaml \
--port 8000
Use Curl#
Generate speech from text without any reference audio:
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "Hello, how are you?"}' \
--output output.wav
Note that without reference audio, the generated voice will sound robotic. For natural-sounding results, use Voice Cloning with a reference audio clip.
Voice Cloning#
The curl examples below use a fixed clip from seed-tts-eval-mini:
LOCAL_DIR="./seed-tts-eval-mini"
hf download --repo-type dataset --local-dir "$LOCAL_DIR" \
zhaochenyang20/seed-tts-eval-mini \
en/prompt-wavs/common_voice_en_10119832.wav
REF_WAV="$(cd "$LOCAL_DIR/en/prompt-wavs" && pwd)/common_voice_en_10119832.wav"
Non-streaming request
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "Get the trust fund to the bank early.", "references": [{"audio_path": "'"$REF_WAV"'", "text": "We asked over twenty different people, and they all said it was his."}]}' \
--output output.wav
The references field lists audio_path and text (transcript of that audio).
Streaming
Enable streaming to receive audio chunks in real time via Server-Sent Events (SSE). Set "stream": true:
curl -N -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "Get the trust fund to the bank early.", "references": [{"audio_path": "'"$REF_WAV"'", "text": "We asked over twenty different people, and they all said it was his."}], "stream": true}'
The server returns a stream of SSE events. Each event contains an audio.speech.chunk object with a base64-encoded audio chunk. The stream ends with data: [DONE].
Use Python#
Basic TTS#
import requests
resp = requests.post(
"http://localhost:8000/v1/audio/speech",
json={"input": "Hello, how are you?"},
)
resp.raise_for_status()
with open("output.wav", "wb") as f:
f.write(resp.content)
Voice Cloning#
Use the same wav as in the curl section (after hf download).
from pathlib import Path
ref_path = (
Path("seed-tts-eval-mini") / "en" / "prompt-wavs" / "common_voice_en_10119832.wav"
).resolve()
SPEECH_INPUT = "Get the trust fund to the bank early."
REFERENCE_TEXT = "We asked over twenty different people, and they all said it was his."
Non-streaming Request
import requests
resp = requests.post(
"http://localhost:8000/v1/audio/speech",
json={
"input": SPEECH_INPUT,
"references": [{"audio_path": str(ref_path), "text": REFERENCE_TEXT}],
},
)
resp.raise_for_status()
with open("output.wav", "wb") as f:
f.write(resp.content)
Streaming Request
import base64, io, json, wave
import requests
payload = {
"input": SPEECH_INPUT,
"references": [{"audio_path": str(ref_path), "text": REFERENCE_TEXT}],
"stream": True,
"response_format": "wav",
}
chunks = []
fmt = None
with requests.post(
"http://localhost:8000/v1/audio/speech",
json=payload,
stream=True,
timeout=600,
) as stream:
stream.raise_for_status()
for line in stream.iter_lines(decode_unicode=True):
if not line or not line.startswith("data: "):
continue
data = line[len("data:"):].lstrip()
if data == "[DONE]":
break
b64 = (json.loads(data).get("audio") or {}).get("data")
if not b64:
continue
with wave.open(io.BytesIO(base64.b64decode(b64)), "rb") as w:
if fmt is None:
fmt = w.getnchannels(), w.getsampwidth(), w.getframerate()
chunks.append(w.readframes(w.getnframes()))
assert fmt
nc, sw, fr = fmt
with wave.open("output_stream.wav", "wb") as w:
w.setnchannels(nc)
w.setsampwidth(sw)
w.setframerate(fr)
w.writeframes(b"".join(chunks))
Request Parameters#
The table below lists all parameters accepted by the /v1/audio/speech endpoint.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
string |
(required) |
Text to synthesize |
|
string |
|
Voice identifier |
|
string |
|
Output audio format |
|
float |
|
Playback speed multiplier |
|
bool |
|
Enable streaming via SSE |
|
list |
|
Reference audio for voice cloning; each item has |
|
int |
|
Maximum number of generated tokens |
|
float |
|
Sampling temperature |
|
float |
|
Top-p sampling |
|
int |
|
Top-k sampling |
|
float |
|
Repetition penalty |
|
int |
|
Random seed for reproducibility |
Interactive Playground#
SGLang-Omni ships with a Gradio-based playground for interactive TTS experimentation:
./playground/tts/start.sh
A demo play video is available here. We highly recommend using playground since audio data is hard to intertact with by CLI.