Inference API

Voice

View as Markdown

Speech to text - REST

/v1/stt

Transcribe an audio file to text.

Request Body

Response Body

text

string

Full transcript text. For multichannel requests, this is a merged transcript across all channels (words interleaved by timestamp).

language

string

Detected language code (ISO 639-1, e.g. en). Currently empty — language detection is not yet enabled.

duration

number

Audio duration in seconds (rounded to 2 decimal places).

words

array

Word-level segments with timestamps. Omitted when empty.

channels

array

Per-channel transcripts. Only present when multichannel=true. Omitted for single-channel audio.


Speech to text - Streaming

WSSwss://api.x.ai/v1/stt

Real-time streaming speech-to-text via WebSocket. The client streams raw audio as binary WebSocket frames, and the server returns JSON transcript events as the audio is processed.
Configuration is done via query parameters on the WebSocket upgrade URL. Audio is sent as raw binary frames — no base64 encoding needed.
The server uses VAD (Voice Activity Detection) to detect when the speaker stops talking and emits utterance-final events. For long continuous speech, the server automatically splits into ~3-second chunks.
After sending audio.done, the server returns a transcript.done event with any remaining transcript not already covered by speech_final events, plus the total audio duration. If all audio was covered by speech_final events, text and words will be empty. The client can then start a new turn without reconnecting.

Handshake

URL

wss://api.x.ai/v1/stt

Method

GET

Status

101 Switching Protocols

Headers

Authorization

string

required

Bearer token authentication. Format: Bearer <your xAI API key>.

Bearer $XAI_API_KEY

Query Parameters

sample_rate

integer

optional

Audio sample rate in Hz. Supported values: `8000`, `16000`, `22050`, `24000`, `44100`, `48000`.

encoding

string

optional

Audio encoding format. `pcm` — signed 16-bit little-endian (2 bytes/sample). `mulaw` — G.711 µ-law (1 byte/sample). `alaw` — G.711 A-law (1 byte/sample).

interim_results

boolean

optional

When `true`, the server emits partial transcript events (`is_final=false`) approximately every 500 ms while audio is being processed. When `false` (default), only finalized results are sent.

endpointing

integer

optional

Silence duration in milliseconds before the server fires a `speech_final=true` event, indicating the speaker stopped talking. Range: 0–5000. Set to `0` for no delay (fire on any VAD silence boundary). Default: 10ms.

language

string

optional

Language code (e.g. `en`, `fr`, `de`, `ja`). When set, enables Inverse Text Normalization — spoken-form numbers, currencies, and units are converted to their written form.

multichannel

boolean

optional

When `true`, enables per-channel transcription for interleaved multichannel audio. Requires `channels` to be set to ≥ 2.

channels

integer

optional

Number of interleaved audio channels. Required when `multichannel=true`. Min: 2, Max: 8.

diarize

boolean

optional

When `true`, enables speaker diarization. Words in `transcript.partial` and `transcript.done` events include a `speaker` field (integer) identifying the detected speaker.

Server → Client