Model Capabilities

Text to Speech (Beta)

View as Markdown

Convert text into spoken audio with a single API call. The xAI Text to Speech API produces natural, expressive speech with support for multiple voices, inline speech tags for fine-grained control over delivery, and a range of output formats - from high-fidelity MP3 to telephony-optimized μ-law.

Endpoint: POST https://api.x.ai/v1/tts

Beta: The Text to Speech API is currently in beta. Pricing and rate limits are subject to change when the API becomes generally available. See current pricing.

No API key needed to get started. Try the playground to hear every voice, experiment with speech tags, and generate audio right in your browser.


Quick Start

Generate speech from text in three lines:

import os
import requests

response = requests.post(
    "https://api.x.ai/v1/tts",
    headers={
        "Authorization": f"Bearer {os.environ['XAI_API_KEY']}",
        "Content-Type": "application/json",
    },
    json={
        "text": "Hello! Welcome to the xAI Text to Speech API.",
        "voice_id": "eve",
    },
)
response.raise_for_status()

with open("hello.mp3", "wb") as f:
    f.write(response.content)

print(f"Saved {len(response.content):,} bytes to hello.mp3")

The response body contains raw audio bytes. Save directly to a file or pipe to an audio player.

Don't have an API key yet? Create one in the console - it only takes a few seconds.


Request Body

ParameterTypeRequiredDescription
textstringThe text to convert to speech. Maximum 4,096 characters. Supports speech tags.
voice_idstringVoice to use for synthesis. Defaults to eve. See Voices.
output_formatobjectOutput format configuration. Defaults to MP3 at 24 kHz / 128 kbps. See Output Formats.

Example with all options

JSON

{
  "text": "Hello! This is a high-fidelity text to speech example.",
  "voice_id": "ara",
  "output_format": {
    "codec": "mp3",
    "sample_rate": 44100,
    "bit_rate": 192000
  }
}

Voices

Five voices are available, each with a distinct personality. Listen to samples and choose the best fit for your use case:

VoiceToneDescriptionSample
eveEnergetic, upbeatDefault voice - engaging and enthusiastic
araWarm, friendlyBalanced and conversational
rexConfident, clearProfessional and articulate - ideal for business
salSmooth, balancedVersatile voice for a wide range of contexts
leoAuthoritative, strongCommanding and decisive - great for instructional content

Voice IDs are case-insensitive - eve, Eve, and EVE all work. Preview all voices in the playground →

Choosing the right voice

  • eve - Great default for demos, announcements, and upbeat content
  • ara - Ideal for conversational interfaces, customer support, and warm narration
  • rex - Best for business presentations, corporate communications, and tutorials
  • sal - Versatile choice for balanced delivery across different content types
  • leo - Perfect for authoritative narration, instructions, and educational content

You can also list voices programmatically with the List voices endpoint:

import os
import requests

response = requests.get(
    "https://api.x.ai/v1/tts/voices",
    headers={"Authorization": f"Bearer {os.environ['XAI_API_KEY']}"},
)
for voice in response.json()["voices"]:
    print(f"{voice['voice_id']:5s}  {voice['name']}")

Supported Languages

The TTS API supports the following languages. The model automatically detects the language of the input text, so no explicit language parameter is needed.

Language CodeLanguage
en-USEnglish (United States)
en-GBEnglish (United Kingdom)
es-ESSpanish (Spain)
es-MXSpanish (Mexico)
fr-FRFrench (France)
it-ITItalian
ko-KRKorean
pl-PLPolish
tr-TRTurkish
ar-AEArabic (United Arab Emirates)
bn-BDBengali (Bangladesh)
el-GRGreek
hu-HUHungarian
ta-INTamil (India)
te-INTelugu (India)
ur-PKUrdu (Pakistan)
vi-VNVietnamese
zh-TWChinese (Taiwan)

Speech Tags

Add inline speech tags to your text for expressive delivery. There are two types of tags:

  • Inline tags [tag] — placed at a specific point in the text to produce a vocal expression (e.g. a laugh or pause)
  • Wrapping tags <tag>text</tag> — wrap a section of text to change how it is delivered (e.g. whispering, singing)

Inline Tags

Insert these where the expression should occur:

CategoryTags
Pauses[pause] [long-pause] [hum-tune]
Laughter & crying[laugh] [chuckle] [giggle] [cry]
Mouth sounds[tsk] [tongue-click] [lip-smack]
Breathing[breath] [inhale] [exhale] [sigh]

Wrapping Tags

Wrap text to change delivery style. Use an opening tag and a matching closing tag:

CategoryTags
Volume & intensity<soft> <whisper> <loud> <build-intensity> <decrease-intensity>
Pitch & speed<higher-pitch> <lower-pitch> <slow> <fast>
Vocal style<sing-song> <singing> <laugh-speak> <emphasis>

Examples

import os
import requests

# Inline tags
response = requests.post(
    "https://api.x.ai/v1/tts",
    headers={
        "Authorization": f"Bearer {os.environ['XAI_API_KEY']}",
        "Content-Type": "application/json",
    },
    json={
        "text": "So I walked in and [pause] there it was. [laugh] I honestly could not believe it!",
        "voice_id": "eve",
    },
)
response.raise_for_status()

with open("expressive.mp3", "wb") as f:
    f.write(response.content)

# Wrapping tags
response = requests.post(
    "https://api.x.ai/v1/tts",
    headers={
        "Authorization": f"Bearer {os.environ['XAI_API_KEY']}",
        "Content-Type": "application/json",
    },
    json={
        "text": "I need to tell you something. <whisper>It is a secret.</whisper> Pretty cool, right?",
        "voice_id": "eve",
    },
)
response.raise_for_status()

with open("whisper.mp3", "wb") as f:
    f.write(response.content)

Tips for speech tags:

  • Place inline tags where the expression would naturally occur in conversation
  • Combine tags with punctuation — "Really? [laugh] That's incredible!" produces more natural results than stacking tags
  • Use [pause] or [long-pause] to add dramatic timing or let a thought land
  • Wrapping tags work best around complete phrases — <whisper>It is a secret.</whisper> reads more naturally than wrapping individual words
  • Combine styles for effect — <slow><soft>Goodnight, sleep well.</soft></slow>

Output Formats

Control the audio codec, sample rate, and bit rate with the output_format object. When omitted, the default is MP3 at 24 kHz / 128 kbps.

Codecs

CodecContent-TypeBest for
mp3audio/mpegGeneral use - wide compatibility, good compression
wavaudio/wavLossless audio - editing, post-production
pcmaudio/pcmRaw audio - real-time processing pipelines
mulawaudio/basicTelephony (G.711 μ-law)
alawaudio/alawTelephony (G.711 A-law)

Sample Rates

RateDescription
8000Narrowband - telephony
16000Wideband - speech recognition
22050Standard - balanced quality
24000High quality - default, recommended for most use cases
44100CD quality - media production
48000Professional - studio-grade audio

Bit Rates (MP3 only)

RateQuality
32000Low - smallest file size
64000Medium - good for speech
96000Standard - balanced
128000High - default, recommended
192000Maximum - highest fidelity

Example: High-fidelity MP3

JSON

{
  "text": "Crystal clear audio at maximum quality.",
  "voice_id": "rex",
  "output_format": {
    "codec": "mp3",
    "sample_rate": 44100,
    "bit_rate": 192000
  }
}

Example: Telephony (μ-law)

JSON

{
  "text": "Hello, thank you for calling. How can I help you today?",
  "voice_id": "ara",
  "output_format": {
    "codec": "mulaw",
    "sample_rate": 8000
  }
}

Best Practices

Tips for getting the highest quality output from the TTS API.

Writing effective text

  • Use natural punctuation. Commas, periods, and question marks guide pacing and intonation. "Wait, really?" sounds more natural than "Wait really".
  • Add emotional context. Exclamation marks and question marks influence delivery - "That's amazing!" sounds enthusiastic while "That's amazing." is matter-of-fact.
  • Break long content into paragraphs. Paragraph breaks create natural pauses and help the model maintain consistent quality across longer text.
  • Keep requests under 4,096 characters. For longer content, split into logical segments (by paragraph or sentence) and concatenate the audio output.

Integrating with AI coding assistants

The Cloud Console playground includes ready-made agent instructions you can copy and paste into tools like Cursor, GitHub Copilot, or Windsurf. The instructions are pre-configured with your current voice and format settings - open the playground, tweak your settings, and copy the prompt to get a tailored integration guide for your coding agent.

Optimizing for production

  • Proxy requests server-side. Never expose your API key in client-side code. Route TTS requests through your backend.
  • Cache generated audio. If the same text is requested repeatedly, cache the audio bytes to save API calls and reduce latency.
  • Match the format to the use case. Use mulaw or alaw at 8 kHz for telephony; mp3 at 24 kHz for web; wav at 44.1+ kHz for post-production.
  • Respect concurrent session limits. The streaming WebSocket endpoint allows up to 50 concurrent sessions per team. For high-throughput services, pool connections or queue requests to stay within this limit.

Browser Playback

To play TTS audio in the browser, proxy the request through your backend and use the Web Audio API or an <audio> element:

Javascript

// Client-side: fetch from your backend proxy, then play
async function speakText(text, voiceId = "eve") {
  const response = await fetch("/api/tts", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ text, voice_id: voiceId }),
  });

  if (!response.ok) throw new Error("TTS request failed");

  const blob = await response.blob();
  const url = URL.createObjectURL(blob);

  const audio = new Audio(url);
  audio.addEventListener("ended", () => URL.revokeObjectURL(url));
  await audio.play();
}

// Usage
await speakText("Hello from the browser!");

Never call the TTS API directly from the browser - this would expose your API key. Always proxy through your backend.

Browser gotchas

Safari returns Infinity for audio.duration on blob URLs. The loadedmetadata event fires but audio.duration is Infinity, breaking seek bars and time displays. Use AudioContext.decodeAudioData() instead:

Javascript

async function getAudioDuration(arrayBuffer) {
  const AudioCtx = window.AudioContext || window.webkitAudioContext;
  const ctx = new AudioCtx();
  // Clone the buffer - decodeAudioData detaches the original
  const decoded = await ctx.decodeAudioData(arrayBuffer.slice(0));
  const durationMs = Math.round(decoded.duration * 1000);
  await ctx.close();
  return durationMs;
}

AudioContext must be created during a user gesture on Safari. Safari permanently suspends an AudioContext created outside a click/tap handler, with no way to resume it. Chrome is more lenient. Always create or resume the context in your button's click handler, before any await:

Javascript

// Create the AudioContext once, in a click handler
let audioCtx;
button.addEventListener("click", async () => {
  // This MUST happen synchronously in the click handler for Safari
  if (!audioCtx) audioCtx = new AudioContext();
  if (audioCtx.state === "suspended") await audioCtx.resume();

  // Now it's safe to fetch and play audio asynchronously
  const response = await fetch("/api/tts", { /* ... */ });
  const arrayBuffer = await response.arrayBuffer();
  const decoded = await audioCtx.decodeAudioData(arrayBuffer);
  const source = audioCtx.createBufferSource();
  source.buffer = decoded;
  source.connect(audioCtx.destination);
  source.start();
});

Raw codecs (pcm, mulaw, alaw) are not playable in the browser. AudioContext.decodeAudioData() and <audio> elements only support container formats like MP3 and WAV. Use mp3 or wav for browser playback. If you're working with raw formats server-side (e.g., piping to telephony), estimate duration from byte count:

Javascript

// PCM = 16-bit LE (2 bytes/sample), mulaw/alaw = 8-bit (1 byte/sample)
const bytesPerSample = codec === "pcm" ? 2 : 1;
const durationMs = Math.round((byteLength / bytesPerSample / sampleRate) * 1000);

Revoke blob URLs to avoid memory leaks. Each URL.createObjectURL() call allocates memory that persists until explicitly freed. Revoke URLs when playback ends. For downloads, delay revocation so the browser finishes saving the file:

Javascript

// Playback: revoke when done
const url = URL.createObjectURL(blob);
const audio = new Audio(url);
audio.addEventListener("ended", () => URL.revokeObjectURL(url));

// Downloads: delay revocation
const downloadUrl = URL.createObjectURL(blob);
const a = document.createElement("a");
a.href = downloadUrl;
a.download = "speech.mp3";
a.click();
setTimeout(() => URL.revokeObjectURL(downloadUrl), 10_000);

Error Handling

StatusMeaningAction
200SuccessAudio bytes in the response body
400Bad requestCheck: text is non-empty, under 4,096 chars; codec and sample rate are valid
401UnauthorizedAPI key is missing or invalid
429Rate limitedBack off and retry with exponential delay
503Service unavailableTTS service is temporarily unavailable - retry
500Server errorRetry with exponential backoff

Retry with backoff

import os
import time
import requests

def generate_speech(text, voice_id="eve", max_retries=3):
    for attempt in range(max_retries):
        response = requests.post(
            "https://api.x.ai/v1/tts",
            headers={
                "Authorization": f"Bearer {os.environ['XAI_API_KEY']}",
                "Content-Type": "application/json",
            },
            json={"text": text, "voice_id": voice_id},
        )
        if response.ok:
            return response.content
        if response.status_code in (429, 500, 503):
            wait = 2 ** attempt
            time.sleep(wait)
            continue
        response.raise_for_status()  # Non-retryable error
    raise RuntimeError("Max retries exceeded")

Streaming TTS (WebSocket)

For real-time audio generation, open a WebSocket connection to the streaming TTS endpoint. Text is streamed in as deltas and audio is streamed back as base64-encoded chunks — ideal for interactive applications where you want audio to start playing before the full text is available.

Endpoint: wss://api.x.ai/v1/tts

Never expose your API key in client-side code. Always proxy WebSocket connections through your backend.


Connection

Open a WebSocket connection with optional query parameters to configure voice and audio format:

Text

GET /v1/tts?voice=eve&codec=mp3&sample_rate=24000&bit_rate=128000
Upgrade: websocket
Authorization: Bearer $XAI_API_KEY

All query parameters are optional:

ParameterDefaultAccepted values
voiceeveara, eve, leo, rex, sal
codecmp3mp3, wav, pcm, mulaw (or ulaw), alaw
sample_rate240008000, 16000, 22050, 24000, 44100, 48000
bit_rate12800032000, 64000, 96000, 128000, 192000 (MP3 only)

An invalid voice, codec, or sample_rate is rejected before the WebSocket upgrade with an HTTP 400 or 404.


Client → Server Messages

Send text to the server as JSON text frames. Split your text across multiple text.delta messages, then signal the end of the utterance with text.done:

JSON

{"type": "text.delta", "delta": "Here is some text. "}
{"type": "text.delta", "delta": "More text follows."}
{"type": "text.done"}
EventDescription
text.deltaA chunk of text to synthesize. Individual deltas are capped at 50,000 characters.
text.doneSignals the end of the current utterance. The server will finish generating audio and send audio.done.

Server → Client Messages

The server responds with base64-encoded audio chunks and a completion event:

JSON

{"type": "audio.delta", "delta": "<base64-encoded audio bytes>"}
{"type": "audio.done", "trace_id": "uuid"}
{"type": "error", "message": "description"}
EventDescription
audio.deltaA chunk of base64-encoded audio in the codec specified at connection time. Decode and enqueue for playback.
audio.doneAll audio for the current utterance has been sent. Includes a trace_id for debugging.
errorAn error occurred. The message field contains a human-readable description.

Multi-Utterance Sessions

The connection stays open after audio.done. You can immediately send another round of text.deltatext.done messages to synthesize additional text without reconnecting. This is useful for conversational UIs where you generate audio for each assistant response in sequence.


Quick Start

import asyncio
import base64
import os

import websockets

XAI_API_KEY = os.environ["XAI_API_KEY"]

async def stream_tts(text: str, voice: str = "eve", codec: str = "mp3"):
    uri = f"wss://api.x.ai/v1/tts?voice={voice}&codec={codec}"
    audio_chunks: list[bytes] = []

    async with websockets.connect(
        uri,
        additional_headers={"Authorization": f"Bearer {XAI_API_KEY}"},
    ) as ws:
        # Send text in one delta (or split across multiple)
        await ws.send('{"type": "text.delta", "delta": ' + f'"{text}"' + "}")
        await ws.send('{"type": "text.done"}')

        # Receive audio chunks until done
        async for message in ws:
            import json
            event = json.loads(message)

            if event["type"] == "audio.delta":
                audio_chunks.append(base64.b64decode(event["delta"]))
            elif event["type"] == "audio.done":
                print(f"Done — trace_id: {event['trace_id']}")
                break
            elif event["type"] == "error":
                raise RuntimeError(event["message"])

    audio = b"".join(audio_chunks)
    with open(f"output.{codec}", "wb") as f:
        f.write(audio)
    print(f"Saved {len(audio):,} bytes to output.{codec}")

asyncio.run(stream_tts("Hello from the streaming TTS API!"))

Limits and Behavior

PropertyValue
Delta sizeIndividual text.delta messages capped at 50,000 characters
Concurrent sessions50 per team
Session permit TTL600 seconds
ModerationRuns asynchronously on accumulated text after audio is sent (fail-open)
BillingRecorded per session based on total input characters


Did you find this page helpful?