Model Capabilities

Voice Overview

View as Markdown

xAI offers two voice APIs: the Text to Speech API for converting text into spoken audio, and the Voice Agent API for real-time, interactive voice conversations with Grok.


Text to Speech API

Convert text into spoken audio with a single API call. Choose from multiple voices, add inline speech tags for expressive delivery, and output in formats ranging from high-fidelity MP3 to telephony-optimized μ-law.

Endpoint: POST https://api.x.ai/v1/tts

Use Cases:

  • Narration for audiobooks, articles, and documentation
  • Dynamic audio content for apps and games
  • Voiceovers for videos and presentations
  • Accessibility - read content aloud for users with visual impairments

Documentation →


Voice Agent API

Build real-time voice applications with the Grok Voice Agent API. Create interactive voice conversations with Grok models via WebSocket for voice assistants, phone agents, and interactive voice applications.

WebSocket Endpoint: wss://api.x.ai/v1/realtime

The Voice Agent API is only available in us-east-1 region.

Use Cases:

  • Voice assistants for web and mobile
  • AI-powered phone systems with Twilio
  • Real-time customer support
  • Interactive Voice Response (IVR) systems

Documentation →


Enterprise Ready

Optimized for enterprise use cases across Customer Support, Medical, Legal, Finance, Insurance, and more.

  • Telephony - Connect to platforms like Twilio, Vonage, and other SIP providers
  • Tool Calling - CRMs, calendars, ticketing systems, databases, and custom APIs
  • Multilingual - Serve global customers in their native language with natural accents
  • Low Latency - Real-time responses for natural, human-like conversations
  • Accuracy - Precise transcription and understanding of critical information:
    • Industry-specific terminology including medical, legal, and financial vocabulary
    • Email addresses, dates, and alphanumeric codes
    • Names, addresses, and phone numbers

Tool Calling

Extend your voice agent's capabilities with powerful built-in tools that execute during conversations:

  • Web Search - Real-time internet search for current information, news, and facts
  • X Search - Search posts, trends, and discussions from X
  • Collections - RAG-powered search over your uploaded documents and knowledge bases
  • Custom Functions - Define your own tools with JSON schemas for booking, lookups, calculations, and more

Tools are called automatically based on conversation context. Your voice agent can search the web, query your documents, and execute custom business logic - all while maintaining a natural conversation flow.


Shared Capabilities

The following capabilities apply to both the Text to Speech API and the Voice Agent API.

Voices

Choose from 5 distinct voices, each with unique characteristics suited to different applications:

VoiceTypeToneDescriptionSample
EveFemaleEnergetic, upbeatDefault voice, engaging and enthusiastic
AraFemaleWarm, friendlyBalanced and conversational
RexMaleConfident, clearProfessional and articulate, ideal for business
SalNeutralSmooth, balancedVersatile voice suitable for various contexts
LeoMaleAuthoritative, strongDecisive and commanding, suitable for instructional content

Multilingual

Both APIs support 20+ languages with natural pronunciation and culturally-aware speech rhythms. The model automatically detects the input language - or you can specify a language code explicitly for consistent results.

LanguageLanguage Code
Englishen
Arabic (Egypt)ar-EG
Arabic (Saudi Arabia)ar-SA
Arabic (United Arab Emirates)ar-AE
Bengalibn
Chinese (Simplified)zh
Frenchfr
Germande
Hindihi
Indonesianid
Italianit
Japaneseja
Koreanko
Portuguese (Brazil)pt-BR
Portuguese (Portugal)pt-PT
Russianru
Spanish (Mexico)es-MX
Spanish (Spain)es-ES
Turkishtr
Vietnamesevi

The model is also capable of generating speech in additional languages beyond those listed above, with varying degrees of accuracy.

Audio Formats

Support for multiple audio formats and sample rates to match your application's requirements:

  • MP3 - Compressed audio, wide compatibility (TTS API)
  • WAV - Lossless audio for editing and post-production (TTS API)
  • PCM (Linear16) - High-quality audio with configurable sample rates (8kHz-48kHz)
  • G.711 μ-law - Optimized for telephony applications
  • G.711 A-law - Standard for international telephony

Example Applications

Complete working examples are available demonstrating various voice integration patterns:

Web Voice Agent

Real-time voice chat in the browser with React frontend and Python/Node.js backends.

Architecture:

Text

Browser (React) ←WebSocket→ Backend (FastAPI/Express) ←WebSocket→ xAI API

Features:

  • Real-time audio streaming
  • Visual transcript display
  • Debug console for development
  • Interchangeable backends

GitHub →

Phone Voice Agent (Twilio)

AI-powered phone system using Twilio integration.

Architecture:

Text

Phone Call ←SIP→ Twilio ←WebSocket→ Node.js Server ←WebSocket→ xAI API

Features:

  • Phone call integration
  • Real-time voice processing
  • Function/tool calling support
  • Production-ready architecture

GitHub →

WebRTC Voice Agent

The Grok Voice Agent API uses WebSocket connections. Direct WebRTC connections are not available currently.

You can use a WebRTC server to connect the client to a server that then connects to the Grok Voice Agent API.

Architecture:

Text

Browser (React) ←WebRTC→ Backend (Express) ←WebSocket→ xAI API

Features:

  • Real-time audio streaming
  • Visual transcript display
  • Debug console for development
  • WebRTC backend handles all WebSocket connections to xAI API

GitHub →


Third Party Integrations

Build voice agents using popular third-party frameworks and platforms that integrate with the Grok Voice Agent API.

LiveKit

Build real-time voice agents using LiveKit's open-source framework with native Grok Voice Agent API integration and WebRTC Support.

Docs → | GitHub →

Voximplant

Build real-time voice agents using Voximplant's open-source framework with native Grok Voice Agent API integration and SIP Support.

Docs → | GitHub →

Pipecat

Build real-time voice agents using Pipecat's open-source framework with native Grok Voice Agent API integration and advanced conversation management.

Docs → | GitHub →


Did you find this page helpful?