Inference API
Voice
Create client secret
/v1/realtime/client_secrets
Create an ephemeral client secret for authenticating browser-side Realtime API connections.
Request Body
Response Body
value
string
The ephemeral token value. Use as a Bearer token in the WebSocket Authorization header, or in the sec-websocket-protocol header with prefix xai-client-secret..
expires_at
integer
Unix timestamp (seconds) when this client secret expires.
Realtime
Real-time voice conversations with Grok models via WebSocket. The connection begins with an HTTP GET that is upgraded to WebSocket (status 101). Once connected, the client and server exchange JSON messages to configure the session, stream audio, and receive responses.
Handshake
URL
wss://api.x.ai/v1/realtime
Method
GET
Status
101 Switching Protocols
Headers
Authorization
string
required
Bearer token for authentication. Use your xAI API key (server-side only) or an ephemeral client secret from the Create client secret endpoint.
Bearer $XAI_API_KEYSec-WebSocket-Protocol
string
Alternative authentication for browser clients. Pass the ephemeral token with prefix xai-client-secret.. When provided, the Authorization header is not required.
xai-client-secret.<EPHEMERAL_TOKEN>Query Parameters
model
string
optional
default: grok-voice-fast-1.0
Model to use for the session. For the best experience, use grok-voice-think-fast-1.0
grok-voice-fast-1.0grok-voice-think-fast-1.0Handshake
wss://api.x.ai/v1/realtimeExample Message Flow
Client → Server
Server → Client
Text to speech - REST
/v1/tts
Convert text into speech audio.
Request Body
text
string
required
The text to convert to speech. Maximum 15,000 characters. Supports inline speech tags for expressive output: [pause], [long-pause], [hum-tune], [laugh], [chuckle], [giggle], [cry], [tsk], [tongue-click], [lip-smack], [breath], [inhale], [exhale], [sigh]. Also supports wrapping tags for style control: <soft>, <whisper>, <loud>, <build-intensity>, <decrease-intensity>, <higher-pitch>, <lower-pitch>, <slow>, <fast>, <sing-song>, <singing>, <laugh-speak>, <emphasis>.
language
string
required
BCP-47 language code (e.g. en, zh, pt-BR) or auto for automatic language detection. Case-insensitive. Supported values: auto, en, ar-EG, ar-SA, ar-AE, bn, zh, fr, de, hi, id, it, ja, ko, pt-BR, pt-PT, ru, es-MX, es-ES, tr, vi. Additional languages may work with varying accuracy.
Text to speech - Streaming
Bidirectional streaming text-to-speech via WebSocket. Send text incrementally and receive audio chunks in real time. Shares the /v1/tts path with the batch POST endpoint — a GET with Upgrade: websocket activates streaming mode. Configuration is done via query parameters at connection time. Supports multi-utterance: after audio.done, send another stream of text.delta messages on the same connection.
Handshake
URL
wss://api.x.ai/v1/tts
Method
GET
Status
101 Switching Protocols
Headers
Authorization
string
required
Bearer token for authentication. Use your xAI API key.
Bearer $XAI_API_KEYQuery Parameters
voice
string
optional
default: eve
Voice identifier. Use a built-in voice from GET /v1/tts/voices (e.g. eve, ara) or a custom voice ID.
language
string
required
BCP-47 language code (e.g. en, zh, pt-BR) or auto for automatic language detection. Case-insensitive.
autoenar-EGar-SAar-AEbnzhfrdehiiditjakopt-BRpt-PTrues-MXes-EStrvicodec
string
optional
default: mp3
Audio codec for the output.
mp3wavpcmmulawalawsample_rate
integer
optional
default: 24000
Sample rate in Hz.
80001600022050240004410048000bit_rate
integer
optional
default: 128000
Bit rate in bps. Only applies when codec is mp3.
320006400096000128000192000optimize_streaming_latency
integer
optional
default: 0
Latency optimization level. 0 (default): No optimization — best audio quality. 1: Reduced first-chunk size for lower time-to-first-audio, with minor quality tradeoff at chunk boundaries.
01text_normalization
boolean
optional
default: false
Enable text normalization before synthesis. When enabled, the model normalizes written-form text (e.g. numbers, abbreviations, symbols) into spoken-form before generating audio.
Handshake
wss://api.x.ai/v1/ttsExample Message Flow
Client → Server
Server → Client
Text to speech - List voices
/v1/tts/voices
List all available TTS voices.
Response Body
voices
array
List of available voices.
Text to speech - Get voice
/v1/tts/voices/{voice_id}
Get details for a specific voice.
Path parameters
voice_id
string
required
The unique identifier of the voice (e.g. `eve`, `ara`).
Response Body
voice_id
string
Unique identifier for the voice (lowercase). Pass this value as voice_id in TTS requests or as the voice parameter in Realtime API session configuration.
name
string
Human-readable display name for the voice.
Speech to text - REST
/v1/stt
Transcribe an audio file to text.
Request Body
Response Body
text
string
Full transcript text. For multichannel requests, this is a merged transcript across all channels (words interleaved by timestamp).
language
string
Detected language code (ISO 639-1, e.g. en). Currently empty — language detection is not yet enabled.
duration
number
Audio duration in seconds (rounded to 2 decimal places).
words
array
Word-level segments with timestamps. Omitted when empty.
channels
array
Per-channel transcripts. Only present when multichannel=true. Omitted for single-channel audio.
Speech to text - Streaming
Real-time streaming speech-to-text via WebSocket. Stream raw audio as binary frames and receive JSON transcript events as the audio is processed. Configuration is done via query parameters at connection time. Each connection handles a single utterance — reconnect to transcribe another.
Handshake
URL
wss://api.x.ai/v1/stt
Method
GET
Status
101 Switching Protocols
Headers
Authorization
string
required
Bearer token authentication. Format: Bearer <your xAI API key>.
Bearer $XAI_API_KEYQuery Parameters
sample_rate
integer
optional
default: 16000
Audio sample rate in Hz. Supported values: 8000, 16000, 22050, 24000, 44100, 48000.
encoding
string
optional
default: pcm
Audio encoding format. pcm — signed 16-bit little-endian (2 bytes/sample). mulaw — G.711 µ-law (1 byte/sample). alaw — G.711 A-law (1 byte/sample).
interim_results
boolean
optional
default: false
When true, the server emits partial transcript events (is_final=false) approximately every 500 ms while audio is being processed. When false (default), only finalized results are sent.
endpointing
integer
optional
default: 10
Silence duration in milliseconds before the server fires a speech_final=true event, indicating the speaker stopped talking. Range: 0–5000. Set to 0 for no delay (fire on any VAD silence boundary). Default: 10ms.
language
string
optional
default:
Language code (e.g. en, fr, de, ja). When set, enables Inverse Text Normalization — spoken-form numbers, currencies, and units are converted to their written form.
multichannel
boolean
optional
default: false
When true, enables per-channel transcription for interleaved multichannel audio. Requires channels to be set to ≥ 2.
channels
integer
optional
default: 1
Number of interleaved audio channels. Required when multichannel=true. Min: 2, Max: 8.
diarize
boolean
optional
default: false
When true, enables speaker diarization. Words in transcript.partial and transcript.done events include a speaker field (integer) identifying the detected speaker.
filler_words
boolean
optional
default: false
When true, filler words (e.g. uh, um, er) are included in the transcript. When false (default), filler words are automatically removed from the transcript text and the words array.
Handshake
wss://api.x.ai/v1/sttClient → Server
Server → Client
Custom voices - Create
/v1/custom-voices
This endpoint is gated to teams on an Enterprise plan — contact our team to enable access. You can also create up to 30 custom voices for free in the console. Custom Voices is currently only available in the United States, with the exception of Illinois.
Create a custom voice from a reference audio clip.
Request Body
file
string
required
Reference audio file. Maximum duration: 120 seconds. Supported formats: WAV, MP3, FLAC, OGG, Opus, M4A, AAC, MKV, MP4 (anything ffmpeg can decode).
Response Body
voice_id
string
8-character lowercase alphanumeric voice identifier. Use this as voice_id in POST /v1/tts, as the voice query parameter on the streaming TTS WebSocket, or as voice in a Voice Agent session.update message.
created_at
string
RFC 3339 timestamp.
Custom voices - List
/v1/custom-voices
List custom voices owned by your team.
Query parameters
limit
integer
Maximum number of voices to return per page. Range: 1-1000. Default: 100.
pagination_token
string
Token from a previous response's `pagination_token` field. Pass to fetch the next page.
Response Body
voices
array
List of custom voices owned by the calling team.
Custom voices - Get
/v1/custom-voices/{voice_id}
Get a single custom voice.
Path parameters
voice_id
string
required
The 8-character lowercase alphanumeric custom voice ID returned by `POST /v1/custom-voices`.
Response Body
voice_id
string
8-character lowercase alphanumeric voice identifier. Use this as voice_id in POST /v1/tts, as the voice query parameter on the streaming TTS WebSocket, or as voice in a Voice Agent session.update message.
created_at
string
RFC 3339 timestamp.
Custom voices - Update
/v1/custom-voices/{voice_id}
Update custom voice metadata.
Path parameters
voice_id
string
required
Request Body
Response Body
voice_id
string
8-character lowercase alphanumeric voice identifier. Use this as voice_id in POST /v1/tts, as the voice query parameter on the streaming TTS WebSocket, or as voice in a Voice Agent session.update message.
created_at
string
RFC 3339 timestamp.
Custom voices - Delete
/v1/custom-voices/{voice_id}
Delete a custom voice.
Path parameters
voice_id
string
required
Response Body
deleted
boolean
Always true on success.
Custom voices - Get audio
/v1/custom-voices/{voice_id}/audio
Download the reference audio for a custom voice.
Path parameters
voice_id
string
required
Did you find this page helpful?
Last updated: April 26, 2026