Model Capabilities
Voice Overview
The xAI Voice APIs offer a range of powerful voice capabilities, all powered by Grok, with enterprise-grade reliability and sub-second latency.
Voice Agent API
Real-time speech-to-speech conversations with tool use, powered by Grok.
Text to Speech
Generate speech with 5 expressive voices, speech tags, and telephony codecs.
Speech to Text
newTranscribe audio to text in 25 languages with batch and streaming modes.
Voice Agent API
Build real-time, speech-to-speech voice agents over WebSockets, with low-latency turn-taking and tool use. For client-side apps, use Ephemeral Tokens to connect securely without exposing your API key.
import asyncio
import json
import os
import websockets
async def voice_agent():
async with websockets.connect(
"wss://api.x.ai/v1/realtime?model=grok-voice-think-fast-1.0",
additional_headers={"Authorization": f"Bearer {os.environ['XAI_API_KEY']}"}
) as ws:
# Configure voice and enable tools
await ws.send(json.dumps({
"type": "session.update",
"session": {
"voice": "eve",
"instructions": "You are a helpful customer support agent.",
"turn_detection": {"type": "server_vad"},
"tools": [{"type": "web_search"}]
}
}))
# Stream audio and receive responses
async for message in ws:
event = json.loads(message)
if event["type"] == "response.output_audio.delta":
# Play audio: base64.b64decode(event["delta"])
pass
asyncio.run(voice_agent())
Demo Apps: Web Agent · Twilio Phone Agent · WebRTC Agent · iOS Tester App
Text to Speech
Convert text to spoken audio in 5 expressive voices. Inline speech tags (laughter, whispers, pauses) and output formats from high-fidelity MP3 to telephony μ-law. Unary requests or WebSocket streaming.
import os
import requests
response = requests.post(
"https://api.x.ai/v1/tts",
headers={
"Authorization": f"Bearer {os.environ['XAI_API_KEY']}",
"Content-Type": "application/json",
},
json={
"text": "Welcome to xAI. How can I help you today?",
"voice_id": "eve",
"language": "en",
},
)
with open("welcome.mp3", "wb") as f:
f.write(response.content)
Real World Examples: LiveKit · Pipecat
Speech to Text
Transcribe audio files in a single call or stream over WebSocket. 12 audio formats, word-level timestamps, multichannel, speaker diarization, and 25 languages.
import os
import requests
response = requests.post(
"https://api.x.ai/v1/stt",
headers={"Authorization": f"Bearer {os.environ['XAI_API_KEY']}"},
files={"file": ("recording.mp3", open("recording.mp3", "rb"), "audio/mpeg")},
)
print(response.json()["text"])
Real World Examples: Voximplant
Quick Start: Custom Voices
Clone a voice from a short reference clip, then use the resulting voice_id anywhere a built-in voice works:
import os
import requests
# 1. Create a custom voice from a reference audio clip (max 120s).
with open("reference.wav", "rb") as f:
create = requests.post(
"https://api.x.ai/v1/custom-voices",
headers={"Authorization": f"Bearer {os.environ['XAI_API_KEY']}"},
files={"file": ("reference.wav", f, "audio/wav")},
data={"name": "Friendly Narrator", "language": "en"},
)
voice_id = create.json()["voice_id"]
# 2. Use the custom voice for TTS.
speech = requests.post(
"https://api.x.ai/v1/tts",
headers={
"Authorization": f"Bearer {os.environ['XAI_API_KEY']}",
"Content-Type": "application/json",
},
json={
"text": "Hello! This is my custom voice.",
"voice_id": voice_id,
"language": "en",
},
)
with open("custom.mp3", "wb") as f:
f.write(speech.content)
The custom voice_id also works with the streaming TTS WebSocket and the Voice Agent realtime API. See the Custom Voices guide for the full API.
Voices
When using the Voice Agent API or Text to Speech, you can choose between 5 distinct voices. Each has its own personality and tone, so pick the one that best fits your application — from upbeat and conversational to authoritative and instructional.
| Voice | Type | Tone | Description | Sample |
|---|---|---|---|---|
eve | Female | Energetic, upbeat | Default voice, engaging and enthusiastic | |
ara | Female | Warm, friendly | Balanced and conversational | |
rex | Male | Confident, clear | Professional and articulate, ideal for business | |
sal | Neutral | Smooth, balanced | Versatile voice suitable for various contexts | |
leo | Male | Authoritative, strong | Decisive and commanding, suitable for instructional content |
Enterprise Compliance & Security
The xAI Voice APIs are built for production workloads with strict security and compliance requirements. All audio data is processed in real time and never stored or used for training.
Last updated: April 26, 2026