Voice

Create client secret

/v1/realtime/client_secrets

Create an ephemeral client secret for authenticating browser-side Realtime API connections.

Request Body

Response Body

value

string

The ephemeral token value. Use as a Bearer token in the WebSocket Authorization header, or in the sec-websocket-protocol header with prefix xai-client-secret..

expires_at

integer

Unix timestamp (seconds) when this client secret expires.

Real-time voice conversations with Grok models via WebSocket. The connection begins with an HTTP GET that is upgraded to WebSocket (status 101). Once connected, the client and server exchange JSON messages to configure the session, stream audio, and receive responses.

Handshake

URL

wss://api.x.ai/v1/realtime

Method

GET

Status

101 Switching Protocols

Headers

Authorization

string

required

Bearer token for authentication. Use your xAI API key (server-side only) or an ephemeral client secret from the Create client secret endpoint.

Bearer $XAI_API_KEY

Sec-WebSocket-Protocol

string

Alternative authentication for browser clients. Pass the ephemeral token with prefix xai-client-secret.. When provided, the Authorization header is not required.

xai-client-secret.<EPHEMERAL_TOKEN>

Query Parameters

model

string

optional

default: grok-voice-fast-1.0

Model to use for the session. For the best experience, use grok-voice-think-fast-1.0

grok-voice-fast-1.0grok-voice-think-fast-1.0

Handshake

URLwss://api.x.ai/v1/realtime

MethodGET

Status101 Switching Protocols

Example Message Flow

↓session.created

↓conversation.created

↑session.update

↓session.updated

↑conversation.item.create

↓conversation.item.added

↑response.create

↓response.created

↓response.output_item.added

↓response.content_part.added

↓response.output_audio.delta

↓response.output_audio_transcript.delta

↓response.output_audio.done

↓response.output_audio_transcript.done

↓response.content_part.done

↓response.output_item.done

↓response.done

↑

Client → Server

↑

session.update

Update session configuration such as system prompt, voice, audio format, turn detection, and tools.

↑

input_audio_buffer.append

Append chunks of base64-encoded audio data to the input buffer. The server does not send back a corresponding message.

↑

input_audio_buffer.commit

Commit the audio buffer as a user message. Only available when turn_detection type is null. Confirmed by input_audio_buffer.committed from the server.

↑

conversation.item.create

Create a new conversation item. Can be a user text message, an assistant text message for history seeding, a function call for seeding tool-use history, or a function call output.

↑

input_audio_buffer.clear

Clear the input audio buffer. Use this to discard any pending audio data without committing it.

↑

conversation.item.delete

Delete a conversation item by ID. The server confirms deletion with a conversation.item.deleted event.

↑

response.create

Request the server to create a new assistant response. This is handled automatically when using server-side VAD.

↑

response.cancel

Cancel an in-progress response. In VAD mode, interruptions are automatic — use this for manual cancel in non-VAD mode.

↓

Server → Client

↓

session.created

Sent automatically on WebSocket connection. Contains the session configuration.

↓

conversation.created

The first message on connection. Notifies the client that a conversation session has been created.

↓

session.updated

Acknowledges the client's session.update message that the session has been configured.

↓

input_audio_buffer.speech_started

Notifies that the server's VAD detected the start of speech. Only available with server_vad turn detection.

↓

input_audio_buffer.speech_stopped

Notifies that the server's VAD detected the end of speech. Only available with server_vad turn detection.

↓

input_audio_buffer.committed

Input audio buffer has been committed as a user message.

↓

input_audio_buffer.cleared

Confirms the input audio buffer has been cleared.

↓

conversation.item.deleted

Confirms a conversation item has been deleted.

↓

conversation.item.added

A new user or assistant message has been added to the conversation history.

↓

conversation.item.input_audio_transcription.completed

Audio transcription for the user's input has been completed.

↓

response.created

A new assistant response turn is in progress. Audio deltas from this turn share the same response_id.

↓

response.output_item.added

A new assistant response item is added to the message history.

↓

response.output_item.done

An output item is complete.

↓

response.content_part.added

A content part starts within an output item.

↓

response.content_part.done

A content part finishes.

↓

response.output_audio_transcript.delta

Streaming text transcript delta of the assistant's audio response.

↓

response.output_audio_transcript.done

The audio transcript for this assistant turn has finished generating.

↓

response.output_audio.delta

Streaming base64-encoded audio delta of the assistant's response.

↓

response.output_audio.done

Audio generation for this assistant turn has finished.

↓

response.text.delta

Text-mode output delta (when using text modality).

↓

response.function_call_arguments.delta

Streaming function call arguments.

↓

response.function_call_arguments.done

A function call has been triggered with complete arguments. Your code should execute the function and return results via conversation.item.create with type function_call_output.

↓

mcp_list_tools.in_progress

MCP tool discovery has started.

↓

mcp_list_tools.completed

MCP tool discovery succeeded.

↓

mcp_list_tools.failed

MCP tool discovery failed.

↓

response.mcp_call_arguments.delta

MCP call arguments streaming.

↓

response.mcp_call_arguments.done

MCP call arguments finalized.

↓

response.mcp_call.in_progress

MCP server HTTP call starting.

↓

response.mcp_call.completed

MCP tool execution succeeded.

↓

response.mcp_call.failed

MCP tool execution failed.

↓

response.done

The assistant's response is completed. Sent after all audio and transcript deltas. Ready for the client to add a new conversation item.

↓

error

Sent when an error occurs. Contains error code and message. Most errors are recoverable and the session stays open.

Text to speech - REST

/v1/tts

Convert text into speech audio.

Request Body

text

string

required

The text to convert to speech. Maximum 15,000 characters. Supports inline speech tags for expressive output: [pause], [long-pause], [hum-tune], [laugh], [chuckle], [giggle], [cry], [tsk], [tongue-click], [lip-smack], [breath], [inhale], [exhale], [sigh]. Also supports wrapping tags for style control: <soft>, <whisper>, <loud>, <build-intensity>, <decrease-intensity>, <higher-pitch>, <lower-pitch>, <slow>, <fast>, <sing-song>, <singing>, <laugh-speak>, <emphasis>.

language

string

required

BCP-47 language code (e.g. en, zh, pt-BR) or auto for automatic language detection. Case-insensitive. Supported values: auto, en, ar-EG, ar-SA, ar-AE, bn, zh, fr, de, hi, id, it, ja, ko, pt-BR, pt-PT, ru, es-MX, es-ES, tr, vi. Additional languages may work with varying accuracy.

Text to speech - Streaming

WSS

Bidirectional streaming text-to-speech via WebSocket. Send text incrementally and receive audio chunks in real time. Shares the /v1/tts path with the batch POST endpoint — a GET with Upgrade: websocket activates streaming mode. Configuration is done via query parameters at connection time. Supports multi-utterance: after audio.done, send another stream of text.delta messages on the same connection.

Handshake

URL

wss://api.x.ai/v1/tts

Method

GET

Status

101 Switching Protocols

Headers

Authorization

string

required

Bearer token for authentication. Use your xAI API key.

Bearer $XAI_API_KEY

Query Parameters

voice

string

optional

default: eve

Voice identifier. Use a built-in voice from GET /v1/tts/voices (e.g. eve, ara) or a custom voice ID.

language

string

required

BCP-47 language code (e.g. en, zh, pt-BR) or auto for automatic language detection. Case-insensitive.

autoenar-EGar-SAar-AEbnzhfrdehiiditjakopt-BRpt-PTrues-MXes-EStrvi

codec

string

optional

default: mp3

Audio codec for the output.

mp3wavpcmmulawalaw

sample_rate

integer

optional

default: 24000

Sample rate in Hz.

80001600022050240004410048000

bit_rate

integer

optional

default: 128000

Bit rate in bps. Only applies when codec is mp3.

320006400096000128000192000

optimize_streaming_latency

integer

optional

default: 0

Latency optimization level. 0 (default): No optimization — best audio quality. 1: Reduced first-chunk size for lower time-to-first-audio, with minor quality tradeoff at chunk boundaries.

01

text_normalization

boolean

optional

default: false

Enable text normalization before synthesis. When enabled, the model normalizes written-form text (e.g. numbers, abbreviations, symbols) into spoken-form before generating audio.

Handshake

URLwss://api.x.ai/v1/tts

MethodGET

Status101 Switching Protocols

Example Message Flow

↑

Client → Server

↑

text.delta

Send a chunk of text to be synthesized. Text is processed incrementally — audio generation begins as soon as enough text is buffered. Individual deltas are capped at 15,000 characters.

↑

text.done

Signal that all text for this utterance has been sent. The server will finish generating audio and send audio.done. After receiving audio.done, you can start a new utterance with another text.delta.

↓

Server → Client

↓

audio.delta

A chunk of base64-encoded audio data. Decode and append to your audio buffer or pipe directly to playback. The format matches the codec and sample_rate specified in the query parameters.

↓

audio.done

Audio generation for this utterance is complete. The connection remains open for multi-utterance — send another text.delta to start a new synthesis, or close the connection.

↓

error

An error occurred during synthesis. The connection may be closed after this message.

Text to speech - List voices

/v1/tts/voices

List all available TTS voices.

Response Body

voices

array

List of available voices.

Text to speech - Get voice

/v1/tts/voices/{voice_id}

Get details for a specific voice.

Path parameters

voice_id

string

required

The unique identifier of the voice (e.g. `eve`, `ara`).

Response Body

voice_id

string

Unique identifier for the voice (lowercase). Pass this value as voice_id in TTS requests or as the voice parameter in Realtime API session configuration.

name

string

Human-readable display name for the voice.

Speech to text - REST

/v1/stt

Transcribe an audio file to text.

Request Body

Response Body

text

string

Full transcript text. For multichannel requests, this is a merged transcript across all channels (words interleaved by timestamp).

language

string

Detected language code (ISO 639-1, e.g. en). Currently empty — language detection is not yet enabled.

duration

number

Audio duration in seconds (rounded to 2 decimal places).

words

array

Word-level segments with timestamps. Omitted when empty.

channels

array

Per-channel transcripts. Only present when multichannel=true. Omitted for single-channel audio.

Speech to text - Streaming

WSS

Real-time streaming speech-to-text via WebSocket. Stream raw audio as binary frames and receive JSON transcript events as the audio is processed. Configuration is done via query parameters at connection time. Each connection handles a single utterance — reconnect to transcribe another.

Handshake

URL

wss://api.x.ai/v1/stt

Method

GET

Status

101 Switching Protocols

Headers

Authorization

string

required

Bearer token authentication. Format: Bearer <your xAI API key>.

Bearer $XAI_API_KEY

Query Parameters

sample_rate

integer

optional

default: 16000

Audio sample rate in Hz. Supported values: 8000, 16000, 22050, 24000, 44100, 48000.

encoding

string

optional

default: pcm

Audio encoding format. pcm — signed 16-bit little-endian (2 bytes/sample). mulaw — G.711 µ-law (1 byte/sample). alaw — G.711 A-law (1 byte/sample).

interim_results

boolean

optional

default: false

When true, the server emits partial transcript events (is_final=false) approximately every 500 ms while audio is being processed. When false (default), only finalized results are sent.

endpointing

integer

optional

default: 10

Silence duration in milliseconds before the server fires a speech_final=true event, indicating the speaker stopped talking. Range: 0–5000. Set to 0 for no delay (fire on any VAD silence boundary). Default: 10ms.

language

string

optional

default:

Language code (e.g. en, fr, de, ja). When set, enables Inverse Text Normalization — spoken-form numbers, currencies, and units are converted to their written form.

multichannel

boolean

optional

default: false

When true, enables per-channel transcription for interleaved multichannel audio. Requires channels to be set to ≥ 2.

channels

integer

optional

default: 1

Number of interleaved audio channels. Required when multichannel=true. Min: 2, Max: 8.

diarize

boolean

optional

default: false

When true, enables speaker diarization. Words in transcript.partial and transcript.done events include a speaker field (integer) identifying the detected speaker.

filler_words

boolean

optional

default: false

When true, filler words (e.g. uh, um, er) are included in the transcript. When false (default), filler words are automatically removed from the transcript text and the words array.

Handshake

URLwss://api.x.ai/v1/stt

MethodGET

Status101 Switching Protocols

Example Message Flow

↓transcript.created

↑Binary frame (audio)

↓transcript.partial

↑Binary frame (audio)

↓transcript.partial

↑Binary frame (audio)

↓transcript.partial

↑audio.done

↓transcript.done

↑

Client → Server

↑

Binary frame (audio)

Send raw audio as binary WebSocket frames in the encoding specified by the encoding query parameter. Audio should be streamed in real-time-paced chunks (e.g. 100 ms at a time). No base64 encoding — send raw bytes directly.

↑

audio.done

Signal that all audio has been sent. The server flushes any remaining buffered audio, emits final transcript events, and sends a transcript.done event. The connection closes after transcript.done.

↓

Server → Client

↓

transcript.created

Sent immediately after the WebSocket connection is established and the server is ready to receive audio. **Wait for this event before sending audio** — the server needs to initialize its ASR backend.

↓

transcript.partial

A transcript result for a portion of the audio stream. Two boolean fields convey state: interim (is_final=false) means text may still change, chunk final (is_final=true, speech_final=false) means the chunk is locked, and utterance final (is_final=true, speech_final=true) means the speaker stopped talking.

↓

transcript.done

Final transcript after audio.done. duration always present. One per channel when multichannel=true. Connection closes after this event.

↓

error

An error occurred during the session. Most errors (pipeline failures, stream timeouts) close the connection. Only client message parse errors keep the connection open.

Custom voices - Create

/v1/custom-voices

This endpoint is gated to teams on an Enterprise plan — contact our team to enable access. You can also create up to 30 custom voices for free in the console. Custom Voices is currently only available in the United States, with the exception of Illinois.

Create a custom voice from a reference audio clip.

Request Body

file

string

required

Reference audio file. Maximum duration: 120 seconds. Supported formats: WAV, MP3, FLAC, OGG, Opus, M4A, AAC, MKV, MP4 (anything ffmpeg can decode).

Response Body

voice_id

string

8-character lowercase alphanumeric voice identifier. Use this as voice_id in POST /v1/tts, as the voice query parameter on the streaming TTS WebSocket, or as voice in a Voice Agent session.update message.

created_at

string

RFC 3339 timestamp.

Custom voices - List

/v1/custom-voices

List custom voices owned by your team.

Query parameters

limit

integer

Maximum number of voices to return per page. Range: 1-1000. Default: 100.

pagination_token

string

Token from a previous response's `pagination_token` field. Pass to fetch the next page.

Response Body

voices

array

List of custom voices owned by the calling team.

Custom voices - Get

/v1/custom-voices/{voice_id}

Get a single custom voice.

Path parameters

voice_id

string

required

The 8-character lowercase alphanumeric custom voice ID returned by `POST /v1/custom-voices`.

Response Body

voice_id

string

created_at

string

RFC 3339 timestamp.

Custom voices - Update

/v1/custom-voices/{voice_id}

Update custom voice metadata.

Path parameters

voice_id

string

required

Request Body

Response Body

voice_id

string

created_at

string

RFC 3339 timestamp.

Custom voices - Delete

/v1/custom-voices/{voice_id}

Delete a custom voice.

Path parameters

voice_id

string

required

Response Body

deleted

boolean

Always true on success.

Custom voices - Get audio

/v1/custom-voices/{voice_id}/audio

Download the reference audio for a custom voice.

Path parameters

voice_id

string

required

Did you find this page helpful?

Last updated: April 26, 2026