Model Capabilities

Voice Overview

View as Markdown

The xAI Voice APIs offer a range of powerful voice capabilities, all powered by Grok, with enterprise-grade reliability and sub-second latency.

Voice Agent API

Real-time speech-to-speech conversations with tool use, powered by Grok.

LatencySub-second
Realtime$3.00 / hour
Endpoint/v1/realtime

Text to Speech

Generate speech with 5 expressive voices, speech tags, and telephony codecs.

Voices5 expressive
Pricing$15.00 / 1M chars
Endpoint/v1/tts

Speech to Text

new

Transcribe audio to text in 25 languages with batch and streaming modes.

Languages25
Pricing$0.20 / hour*
Endpoint/v1/stt

Voice Agent API

Build real-time, speech-to-speech voice agents over WebSockets, with low-latency turn-taking and tool use. For client-side apps, use Ephemeral Tokens to connect securely without exposing your API key.

import asyncio
import json
import os
import websockets

async def voice_agent():
    async with websockets.connect(
        "wss://api.x.ai/v1/realtime?model=grok-voice-think-fast-1.0",
        additional_headers={"Authorization": f"Bearer {os.environ['XAI_API_KEY']}"}
    ) as ws:
        # Configure voice and enable tools
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "voice": "eve",
                "instructions": "You are a helpful customer support agent.",
                "turn_detection": {"type": "server_vad"},
                "tools": [{"type": "web_search"}]
            }
        }))
        
        # Stream audio and receive responses
        async for message in ws:
            event = json.loads(message)
            if event["type"] == "response.output_audio.delta":
                # Play audio: base64.b64decode(event["delta"])
                pass

asyncio.run(voice_agent())

Demo Apps: Web Agent · Twilio Phone Agent · WebRTC Agent · iOS Tester App

Text to Speech

Convert text to spoken audio in 5 expressive voices. Inline speech tags (laughter, whispers, pauses) and output formats from high-fidelity MP3 to telephony μ-law. Unary requests or WebSocket streaming.

import os
import requests

response = requests.post(
    "https://api.x.ai/v1/tts",
    headers={
        "Authorization": f"Bearer {os.environ['XAI_API_KEY']}",
        "Content-Type": "application/json",
    },
    json={
        "text": "Welcome to xAI. How can I help you today?",
        "voice_id": "eve",
        "language": "en",
    },
)

with open("welcome.mp3", "wb") as f:
    f.write(response.content)

Real World Examples: LiveKit · Pipecat

Speech to Text

Transcribe audio files in a single call or stream over WebSocket. 12 audio formats, word-level timestamps, multichannel, speaker diarization, and 25 languages.

import os
import requests

response = requests.post(
    "https://api.x.ai/v1/stt",
    headers={"Authorization": f"Bearer {os.environ['XAI_API_KEY']}"},
    files={"file": ("recording.mp3", open("recording.mp3", "rb"), "audio/mpeg")},
)

print(response.json()["text"])

Real World Examples: Voximplant


Quick Start: Custom Voices

Clone a voice from a short reference clip, then use the resulting voice_id anywhere a built-in voice works:

import os
import requests

# 1. Create a custom voice from a reference audio clip (max 120s).
with open("reference.wav", "rb") as f:
    create = requests.post(
        "https://api.x.ai/v1/custom-voices",
        headers={"Authorization": f"Bearer {os.environ['XAI_API_KEY']}"},
        files={"file": ("reference.wav", f, "audio/wav")},
        data={"name": "Friendly Narrator", "language": "en"},
    )
voice_id = create.json()["voice_id"]

# 2. Use the custom voice for TTS.
speech = requests.post(
    "https://api.x.ai/v1/tts",
    headers={
        "Authorization": f"Bearer {os.environ['XAI_API_KEY']}",
        "Content-Type": "application/json",
    },
    json={
        "text": "Hello! This is my custom voice.",
        "voice_id": voice_id,
        "language": "en",
    },
)
with open("custom.mp3", "wb") as f:
    f.write(speech.content)

The custom voice_id also works with the streaming TTS WebSocket and the Voice Agent realtime API. See the Custom Voices guide for the full API.


Voices

When using the Voice Agent API or Text to Speech, you can choose between 5 distinct voices. Each has its own personality and tone, so pick the one that best fits your application — from upbeat and conversational to authoritative and instructional.

VoiceTypeToneDescriptionSample
eveFemaleEnergetic, upbeatDefault voice, engaging and enthusiastic
araFemaleWarm, friendlyBalanced and conversational
rexMaleConfident, clearProfessional and articulate, ideal for business
salNeutralSmooth, balancedVersatile voice suitable for various contexts
leoMaleAuthoritative, strongDecisive and commanding, suitable for instructional content

Enterprise Compliance & Security

The xAI Voice APIs are built for production workloads with strict security and compliance requirements. All audio data is processed in real time and never stored or used for training.

SOC 2 Type II
Audited controls for security, availability, and confidentiality
HIPAA Eligible
BAA available for healthcare applications handling PHI
GDPR Compliant
Data processing agreements and EU data residency options
Data Residency
Regional processing for compliance requirements
High Availability
Multi-region infrastructure with custom SLAs for enterprise workloads
SSO & RBAC
SAML SSO, role-based access, and audit logging

Last updated: April 26, 2026