Model Capabilities
Text to Speech (Beta)
Convert text into spoken audio with a single API call. The xAI Text to Speech API produces natural, expressive speech with support for multiple voices, inline speech tags for fine-grained control over delivery, and a range of output formats - from high-fidelity MP3 to telephony-optimized μ-law.
Endpoint: POST https://api.x.ai/v1/tts
Beta: The Text to Speech API is currently in beta. Pricing and rate limits are subject to change when the API becomes generally available. See current pricing.
No API key needed to get started. Try the playground to hear every voice, experiment with speech tags, and generate audio right in your browser.
Quick Start
Generate speech from text in three lines:
import os
import requests
response = requests.post(
"https://api.x.ai/v1/tts",
headers={
"Authorization": f"Bearer {os.environ['XAI_API_KEY']}",
"Content-Type": "application/json",
},
json={
"text": "Hello! Welcome to the xAI Text to Speech API.",
"voice_id": "eve",
},
)
response.raise_for_status()
with open("hello.mp3", "wb") as f:
f.write(response.content)
print(f"Saved {len(response.content):,} bytes to hello.mp3")
The response body contains raw audio bytes. Save directly to a file or pipe to an audio player.
Don't have an API key yet? Create one in the console - it only takes a few seconds.
Request Body
| Parameter | Type | Required | Description |
|---|---|---|---|
text | string | ✓ | The text to convert to speech. Maximum 4,096 characters. Supports speech tags. |
voice_id | string | Voice to use for synthesis. Defaults to eve. See Voices. | |
output_format | object | Output format configuration. Defaults to MP3 at 24 kHz / 128 kbps. See Output Formats. |
Example with all options
JSON
{
"text": "Hello! This is a high-fidelity text to speech example.",
"voice_id": "ara",
"output_format": {
"codec": "mp3",
"sample_rate": 44100,
"bit_rate": 192000
}
}
Voices
Five voices are available, each with a distinct personality. Listen to samples and choose the best fit for your use case:
| Voice | Tone | Description | Sample |
|---|---|---|---|
eve | Energetic, upbeat | Default voice - engaging and enthusiastic | |
ara | Warm, friendly | Balanced and conversational | |
rex | Confident, clear | Professional and articulate - ideal for business | |
sal | Smooth, balanced | Versatile voice for a wide range of contexts | |
leo | Authoritative, strong | Commanding and decisive - great for instructional content |
Voice IDs are case-insensitive - eve, Eve, and EVE all work. Preview all voices in the playground →
Choosing the right voice
eve- Great default for demos, announcements, and upbeat contentara- Ideal for conversational interfaces, customer support, and warm narrationrex- Best for business presentations, corporate communications, and tutorialssal- Versatile choice for balanced delivery across different content typesleo- Perfect for authoritative narration, instructions, and educational content
You can also list voices programmatically with the List voices endpoint:
import os
import requests
response = requests.get(
"https://api.x.ai/v1/tts/voices",
headers={"Authorization": f"Bearer {os.environ['XAI_API_KEY']}"},
)
for voice in response.json()["voices"]:
print(f"{voice['voice_id']:5s} {voice['name']}")
Supported Languages
The TTS API supports the following languages. The model automatically detects the language of the input text, so no explicit language parameter is needed.
| Language Code | Language |
|---|---|
en-US | English (United States) |
en-GB | English (United Kingdom) |
es-ES | Spanish (Spain) |
es-MX | Spanish (Mexico) |
fr-FR | French (France) |
it-IT | Italian |
ko-KR | Korean |
pl-PL | Polish |
tr-TR | Turkish |
ar-AE | Arabic (United Arab Emirates) |
bn-BD | Bengali (Bangladesh) |
el-GR | Greek |
hu-HU | Hungarian |
ta-IN | Tamil (India) |
te-IN | Telugu (India) |
ur-PK | Urdu (Pakistan) |
vi-VN | Vietnamese |
zh-TW | Chinese (Taiwan) |
Speech Tags
Add inline speech tags to your text for expressive delivery. There are two types of tags:
- Inline tags
[tag]— placed at a specific point in the text to produce a vocal expression (e.g. a laugh or pause) - Wrapping tags
<tag>text</tag>— wrap a section of text to change how it is delivered (e.g. whispering, singing)
Inline Tags
Insert these where the expression should occur:
| Category | Tags |
|---|---|
| Pauses | [pause] [long-pause] [hum-tune] |
| Laughter & crying | [laugh] [chuckle] [giggle] [cry] |
| Mouth sounds | [tsk] [tongue-click] [lip-smack] |
| Breathing | [breath] [inhale] [exhale] [sigh] |
Wrapping Tags
Wrap text to change delivery style. Use an opening tag and a matching closing tag:
| Category | Tags |
|---|---|
| Volume & intensity | <soft> <whisper> <loud> <build-intensity> <decrease-intensity> |
| Pitch & speed | <higher-pitch> <lower-pitch> <slow> <fast> |
| Vocal style | <sing-song> <singing> <laugh-speak> <emphasis> |
Examples
import os
import requests
# Inline tags
response = requests.post(
"https://api.x.ai/v1/tts",
headers={
"Authorization": f"Bearer {os.environ['XAI_API_KEY']}",
"Content-Type": "application/json",
},
json={
"text": "So I walked in and [pause] there it was. [laugh] I honestly could not believe it!",
"voice_id": "eve",
},
)
response.raise_for_status()
with open("expressive.mp3", "wb") as f:
f.write(response.content)
# Wrapping tags
response = requests.post(
"https://api.x.ai/v1/tts",
headers={
"Authorization": f"Bearer {os.environ['XAI_API_KEY']}",
"Content-Type": "application/json",
},
json={
"text": "I need to tell you something. <whisper>It is a secret.</whisper> Pretty cool, right?",
"voice_id": "eve",
},
)
response.raise_for_status()
with open("whisper.mp3", "wb") as f:
f.write(response.content)
Tips for speech tags:
- Place inline tags where the expression would naturally occur in conversation
- Combine tags with punctuation —
"Really? [laugh] That's incredible!"produces more natural results than stacking tags - Use
[pause]or[long-pause]to add dramatic timing or let a thought land - Wrapping tags work best around complete phrases —
<whisper>It is a secret.</whisper>reads more naturally than wrapping individual words - Combine styles for effect —
<slow><soft>Goodnight, sleep well.</soft></slow>
Output Formats
Control the audio codec, sample rate, and bit rate with the output_format object. When omitted, the default is MP3 at 24 kHz / 128 kbps.
Codecs
| Codec | Content-Type | Best for |
|---|---|---|
mp3 | audio/mpeg | General use - wide compatibility, good compression |
wav | audio/wav | Lossless audio - editing, post-production |
pcm | audio/pcm | Raw audio - real-time processing pipelines |
mulaw | audio/basic | Telephony (G.711 μ-law) |
alaw | audio/alaw | Telephony (G.711 A-law) |
Sample Rates
| Rate | Description |
|---|---|
8000 | Narrowband - telephony |
16000 | Wideband - speech recognition |
22050 | Standard - balanced quality |
24000 | High quality - default, recommended for most use cases |
44100 | CD quality - media production |
48000 | Professional - studio-grade audio |
Bit Rates (MP3 only)
| Rate | Quality |
|---|---|
32000 | Low - smallest file size |
64000 | Medium - good for speech |
96000 | Standard - balanced |
128000 | High - default, recommended |
192000 | Maximum - highest fidelity |
Example: High-fidelity MP3
JSON
{
"text": "Crystal clear audio at maximum quality.",
"voice_id": "rex",
"output_format": {
"codec": "mp3",
"sample_rate": 44100,
"bit_rate": 192000
}
}
Example: Telephony (μ-law)
JSON
{
"text": "Hello, thank you for calling. How can I help you today?",
"voice_id": "ara",
"output_format": {
"codec": "mulaw",
"sample_rate": 8000
}
}
Best Practices
Tips for getting the highest quality output from the TTS API.
Writing effective text
- Use natural punctuation. Commas, periods, and question marks guide pacing and intonation.
"Wait, really?"sounds more natural than"Wait really". - Add emotional context. Exclamation marks and question marks influence delivery -
"That's amazing!"sounds enthusiastic while"That's amazing."is matter-of-fact. - Break long content into paragraphs. Paragraph breaks create natural pauses and help the model maintain consistent quality across longer text.
- Keep requests under 4,096 characters. For longer content, split into logical segments (by paragraph or sentence) and concatenate the audio output.
Integrating with AI coding assistants
The Cloud Console playground includes ready-made agent instructions you can copy and paste into tools like Cursor, GitHub Copilot, or Windsurf. The instructions are pre-configured with your current voice and format settings - open the playground, tweak your settings, and copy the prompt to get a tailored integration guide for your coding agent.
Optimizing for production
- Proxy requests server-side. Never expose your API key in client-side code. Route TTS requests through your backend.
- Cache generated audio. If the same text is requested repeatedly, cache the audio bytes to save API calls and reduce latency.
- Match the format to the use case. Use
mulaworalawat 8 kHz for telephony;mp3at 24 kHz for web;wavat 44.1+ kHz for post-production. - Respect concurrent session limits. The streaming WebSocket endpoint allows up to 50 concurrent sessions per team. For high-throughput services, pool connections or queue requests to stay within this limit.
Browser Playback
To play TTS audio in the browser, proxy the request through your backend and use the Web Audio API or an <audio> element:
Javascript
// Client-side: fetch from your backend proxy, then play
async function speakText(text, voiceId = "eve") {
const response = await fetch("/api/tts", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ text, voice_id: voiceId }),
});
if (!response.ok) throw new Error("TTS request failed");
const blob = await response.blob();
const url = URL.createObjectURL(blob);
const audio = new Audio(url);
audio.addEventListener("ended", () => URL.revokeObjectURL(url));
await audio.play();
}
// Usage
await speakText("Hello from the browser!");
Never call the TTS API directly from the browser - this would expose your API key. Always proxy through your backend.
Browser gotchas
Safari returns Infinity for audio.duration on blob URLs. The loadedmetadata event fires but audio.duration is Infinity, breaking seek bars and time displays. Use AudioContext.decodeAudioData() instead:
Javascript
async function getAudioDuration(arrayBuffer) {
const AudioCtx = window.AudioContext || window.webkitAudioContext;
const ctx = new AudioCtx();
// Clone the buffer - decodeAudioData detaches the original
const decoded = await ctx.decodeAudioData(arrayBuffer.slice(0));
const durationMs = Math.round(decoded.duration * 1000);
await ctx.close();
return durationMs;
}
AudioContext must be created during a user gesture on Safari. Safari permanently suspends an AudioContext created outside a click/tap handler, with no way to resume it. Chrome is more lenient. Always create or resume the context in your button's click handler, before any await:
Javascript
// Create the AudioContext once, in a click handler
let audioCtx;
button.addEventListener("click", async () => {
// This MUST happen synchronously in the click handler for Safari
if (!audioCtx) audioCtx = new AudioContext();
if (audioCtx.state === "suspended") await audioCtx.resume();
// Now it's safe to fetch and play audio asynchronously
const response = await fetch("/api/tts", { /* ... */ });
const arrayBuffer = await response.arrayBuffer();
const decoded = await audioCtx.decodeAudioData(arrayBuffer);
const source = audioCtx.createBufferSource();
source.buffer = decoded;
source.connect(audioCtx.destination);
source.start();
});
Raw codecs (pcm, mulaw, alaw) are not playable in the browser. AudioContext.decodeAudioData() and <audio> elements only support container formats like MP3 and WAV. Use mp3 or wav for browser playback. If you're working with raw formats server-side (e.g., piping to telephony), estimate duration from byte count:
Javascript
// PCM = 16-bit LE (2 bytes/sample), mulaw/alaw = 8-bit (1 byte/sample)
const bytesPerSample = codec === "pcm" ? 2 : 1;
const durationMs = Math.round((byteLength / bytesPerSample / sampleRate) * 1000);
Revoke blob URLs to avoid memory leaks. Each URL.createObjectURL() call allocates memory that persists until explicitly freed. Revoke URLs when playback ends. For downloads, delay revocation so the browser finishes saving the file:
Javascript
// Playback: revoke when done
const url = URL.createObjectURL(blob);
const audio = new Audio(url);
audio.addEventListener("ended", () => URL.revokeObjectURL(url));
// Downloads: delay revocation
const downloadUrl = URL.createObjectURL(blob);
const a = document.createElement("a");
a.href = downloadUrl;
a.download = "speech.mp3";
a.click();
setTimeout(() => URL.revokeObjectURL(downloadUrl), 10_000);
Error Handling
| Status | Meaning | Action |
|---|---|---|
200 | Success | Audio bytes in the response body |
400 | Bad request | Check: text is non-empty, under 4,096 chars; codec and sample rate are valid |
401 | Unauthorized | API key is missing or invalid |
429 | Rate limited | Back off and retry with exponential delay |
503 | Service unavailable | TTS service is temporarily unavailable - retry |
500 | Server error | Retry with exponential backoff |
Retry with backoff
import os
import time
import requests
def generate_speech(text, voice_id="eve", max_retries=3):
for attempt in range(max_retries):
response = requests.post(
"https://api.x.ai/v1/tts",
headers={
"Authorization": f"Bearer {os.environ['XAI_API_KEY']}",
"Content-Type": "application/json",
},
json={"text": text, "voice_id": voice_id},
)
if response.ok:
return response.content
if response.status_code in (429, 500, 503):
wait = 2 ** attempt
time.sleep(wait)
continue
response.raise_for_status() # Non-retryable error
raise RuntimeError("Max retries exceeded")
Streaming TTS (WebSocket)
For real-time audio generation, open a WebSocket connection to the streaming TTS endpoint. Text is streamed in as deltas and audio is streamed back as base64-encoded chunks — ideal for interactive applications where you want audio to start playing before the full text is available.
Endpoint: wss://api.x.ai/v1/tts
Never expose your API key in client-side code. Always proxy WebSocket connections through your backend.
Connection
Open a WebSocket connection with optional query parameters to configure voice and audio format:
Text
GET /v1/tts?voice=eve&codec=mp3&sample_rate=24000&bit_rate=128000
Upgrade: websocket
Authorization: Bearer $XAI_API_KEY
All query parameters are optional:
| Parameter | Default | Accepted values |
|---|---|---|
voice | eve | ara, eve, leo, rex, sal |
codec | mp3 | mp3, wav, pcm, mulaw (or ulaw), alaw |
sample_rate | 24000 | 8000, 16000, 22050, 24000, 44100, 48000 |
bit_rate | 128000 | 32000, 64000, 96000, 128000, 192000 (MP3 only) |
An invalid voice, codec, or sample_rate is rejected before the WebSocket upgrade with an HTTP 400 or 404.
Client → Server Messages
Send text to the server as JSON text frames. Split your text across multiple text.delta messages, then signal the end of the utterance with text.done:
JSON
{"type": "text.delta", "delta": "Here is some text. "}
{"type": "text.delta", "delta": "More text follows."}
{"type": "text.done"}
| Event | Description |
|---|---|
text.delta | A chunk of text to synthesize. Individual deltas are capped at 50,000 characters. |
text.done | Signals the end of the current utterance. The server will finish generating audio and send audio.done. |
Server → Client Messages
The server responds with base64-encoded audio chunks and a completion event:
JSON
{"type": "audio.delta", "delta": "<base64-encoded audio bytes>"}
{"type": "audio.done", "trace_id": "uuid"}
{"type": "error", "message": "description"}
| Event | Description |
|---|---|
audio.delta | A chunk of base64-encoded audio in the codec specified at connection time. Decode and enqueue for playback. |
audio.done | All audio for the current utterance has been sent. Includes a trace_id for debugging. |
error | An error occurred. The message field contains a human-readable description. |
Multi-Utterance Sessions
The connection stays open after audio.done. You can immediately send another round of text.delta → text.done messages to synthesize additional text without reconnecting. This is useful for conversational UIs where you generate audio for each assistant response in sequence.
Quick Start
import asyncio
import base64
import os
import websockets
XAI_API_KEY = os.environ["XAI_API_KEY"]
async def stream_tts(text: str, voice: str = "eve", codec: str = "mp3"):
uri = f"wss://api.x.ai/v1/tts?voice={voice}&codec={codec}"
audio_chunks: list[bytes] = []
async with websockets.connect(
uri,
additional_headers={"Authorization": f"Bearer {XAI_API_KEY}"},
) as ws:
# Send text in one delta (or split across multiple)
await ws.send('{"type": "text.delta", "delta": ' + f'"{text}"' + "}")
await ws.send('{"type": "text.done"}')
# Receive audio chunks until done
async for message in ws:
import json
event = json.loads(message)
if event["type"] == "audio.delta":
audio_chunks.append(base64.b64decode(event["delta"]))
elif event["type"] == "audio.done":
print(f"Done — trace_id: {event['trace_id']}")
break
elif event["type"] == "error":
raise RuntimeError(event["message"])
audio = b"".join(audio_chunks)
with open(f"output.{codec}", "wb") as f:
f.write(audio)
print(f"Saved {len(audio):,} bytes to output.{codec}")
asyncio.run(stream_tts("Hello from the streaming TTS API!"))
Limits and Behavior
| Property | Value |
|---|---|
| Delta size | Individual text.delta messages capped at 50,000 characters |
| Concurrent sessions | 50 per team |
| Session permit TTL | 600 seconds |
| Moderation | Runs asynchronously on accumulated text after audio is sent (fail-open) |
| Billing | Recorded per session based on total input characters |
Related
- TTS Playground - Try voices and speech tags in your browser
- Create an API Key - Get started with the API
- Voice Overview - Overview of all xAI voice capabilities
- Voice Agent API - Real-time voice conversations via WebSocket
- API Reference - Full TTS endpoint specification
- List Voices - Programmatically discover available voices
Did you find this page helpful?