Voice Overview

View as Markdown

xAI offers two voice APIs: the Text to Speech API for converting text into spoken audio, and the Voice Agent API for real-time, interactive voice conversations with Grok.

Convert text into spoken audio with a single API call. Choose from multiple voices, add inline speech tags for expressive delivery, and output in formats ranging from high-fidelity MP3 to telephony-optimized μ-law.

Endpoint: POST https://api.x.ai/v1/tts

Use Cases:

Narration for audiobooks, articles, and documentation
Dynamic audio content for apps and games
Voiceovers for videos and presentations
Accessibility - read content aloud for users with visual impairments

Documentation →

Voice Agent API

Build real-time voice applications with the Grok Voice Agent API. Create interactive voice conversations with Grok models via WebSocket for voice assistants, phone agents, and interactive voice applications.

WebSocket Endpoint: wss://api.x.ai/v1/realtime

The Voice Agent API is only available in us-east-1 region.

Use Cases:

Voice assistants for web and mobile
AI-powered phone systems with Twilio
Real-time customer support
Interactive Voice Response (IVR) systems

Documentation →

Enterprise Ready

Optimized for enterprise use cases across Customer Support, Medical, Legal, Finance, Insurance, and more.

Telephony - Connect to platforms like Twilio, Vonage, and other SIP providers
Tool Calling - CRMs, calendars, ticketing systems, databases, and custom APIs
Multilingual - Serve global customers in their native language with natural accents
Low Latency - Real-time responses for natural, human-like conversations
Accuracy - Precise transcription and understanding of critical information:
- Industry-specific terminology including medical, legal, and financial vocabulary
- Email addresses, dates, and alphanumeric codes
- Names, addresses, and phone numbers

Tool Calling

Extend your voice agent's capabilities with powerful built-in tools that execute during conversations:

Web Search - Real-time internet search for current information, news, and facts
X Search - Search posts, trends, and discussions from X
Collections - RAG-powered search over your uploaded documents and knowledge bases
Custom Functions - Define your own tools with JSON schemas for booking, lookups, calculations, and more

Tools are called automatically based on conversation context. Your voice agent can search the web, query your documents, and execute custom business logic - all while maintaining a natural conversation flow.

Shared Capabilities

The following capabilities apply to both the Text to Speech API and the Voice Agent API.

Voices

Choose from 5 distinct voices, each with unique characteristics suited to different applications:

Voice	Type	Tone	Description
`Eve`	Female	Energetic, upbeat	Default voice, engaging and enthusiastic
`Ara`	Female	Warm, friendly	Balanced and conversational
`Rex`	Male	Confident, clear	Professional and articulate, ideal for business
`Sal`	Neutral	Smooth, balanced	Versatile voice suitable for various contexts
`Leo`	Male	Authoritative, strong	Decisive and commanding, suitable for instructional content

Multilingual

Both APIs support 20+ languages with natural pronunciation and culturally-aware speech rhythms. The model automatically detects the input language - or you can specify a language code explicitly for consistent results.

Language	Language Code
English	`en`
Arabic (Egypt)	`ar-EG`
Arabic (Saudi Arabia)	`ar-SA`
Arabic (United Arab Emirates)	`ar-AE`
Bengali	`bn`
Chinese (Simplified)	`zh`
French	`fr`
German	`de`
Hindi	`hi`
Indonesian	`id`
Italian	`it`
Japanese	`ja`
Korean	`ko`
Portuguese (Brazil)	`pt-BR`
Portuguese (Portugal)	`pt-PT`
Russian	`ru`
Spanish (Mexico)	`es-MX`
Spanish (Spain)	`es-ES`
Turkish	`tr`
Vietnamese	`vi`

The model is also capable of generating speech in additional languages beyond those listed above, with varying degrees of accuracy.

Audio Formats

Support for multiple audio formats and sample rates to match your application's requirements:

MP3 - Compressed audio, wide compatibility (TTS API)
WAV - Lossless audio for editing and post-production (TTS API)
PCM (Linear16) - High-quality audio with configurable sample rates (8kHz-48kHz)
G.711 μ-law - Optimized for telephony applications
G.711 A-law - Standard for international telephony

Example Applications

Complete working examples are available demonstrating various voice integration patterns:

Web Voice Agent

Real-time voice chat in the browser with React frontend and Python/Node.js backends.

Architecture:

Text

Browser (React) ←WebSocket→ Backend (FastAPI/Express) ←WebSocket→ xAI API

Features:

Real-time audio streaming
Visual transcript display
Debug console for development
Interchangeable backends

GitHub →

Phone Voice Agent (Twilio)

AI-powered phone system using Twilio integration.

Architecture:

Text

Phone Call ←SIP→ Twilio ←WebSocket→ Node.js Server ←WebSocket→ xAI API

Features:

Phone call integration
Real-time voice processing
Function/tool calling support
Production-ready architecture

GitHub →

WebRTC Voice Agent

The Grok Voice Agent API uses WebSocket connections. Direct WebRTC connections are not available currently.

You can use a WebRTC server to connect the client to a server that then connects to the Grok Voice Agent API.

Architecture:

Text

Browser (React) ←WebRTC→ Backend (Express) ←WebSocket→ xAI API

Features:

Real-time audio streaming
Visual transcript display
Debug console for development
WebRTC backend handles all WebSocket connections to xAI API

GitHub →

Third Party Integrations

Build voice agents using popular third-party frameworks and platforms that integrate with the Grok Voice Agent API.

LiveKit

Build real-time voice agents using LiveKit's open-source framework with native Grok Voice Agent API integration and WebRTC Support.

Docs → | GitHub →

Voximplant

Build real-time voice agents using Voximplant's open-source framework with native Grok Voice Agent API integration and SIP Support.

Docs → | GitHub →

Pipecat

Build real-time voice agents using Pipecat's open-source framework with native Grok Voice Agent API integration and advanced conversation management.

Docs → | GitHub →

Did you find this page helpful?