Inference API
Voice
Speech to text - REST
/v1/stt
Transcribe an audio file to text.
Request Body
Response Body
text
string
Full transcript text. For multichannel requests, this is a merged transcript across all channels (words interleaved by timestamp).
language
string
Detected language code (ISO 639-1, e.g. en). Currently empty — language detection is not yet enabled.
duration
number
Audio duration in seconds (rounded to 2 decimal places).
words
array
Word-level segments with timestamps. Omitted when empty.
channels
array
Per-channel transcripts. Only present when multichannel=true. Omitted for single-channel audio.
Speech to text - Streaming
Real-time streaming speech-to-text via WebSocket. Stream raw audio as binary frames and receive JSON transcript events as the audio is processed. Configuration is done via query parameters at connection time. Each connection handles a single utterance — reconnect to transcribe another.
Handshake
URL
wss://api.x.ai/v1/stt
Method
GET
Status
101 Switching Protocols
Headers
Authorization
string
required
Bearer token authentication. Format: Bearer <your xAI API key>.
Bearer $XAI_API_KEYQuery Parameters
sample_rate
integer
optional
default: 16000
Audio sample rate in Hz. Supported values: 8000, 16000, 22050, 24000, 44100, 48000.
encoding
string
optional
default: pcm
Audio encoding format. pcm — signed 16-bit little-endian (2 bytes/sample). mulaw — G.711 µ-law (1 byte/sample). alaw — G.711 A-law (1 byte/sample).
interim_results
boolean
optional
default: false
When true, the server emits partial transcript events (is_final=false) approximately every 500 ms while audio is being processed. When false (default), only finalized results are sent.
endpointing
integer
optional
default: 10
Silence duration in milliseconds before the server fires a speech_final=true event, indicating the speaker stopped talking. Range: 0–5000. Set to 0 for no delay (fire on any VAD silence boundary). Default: 10ms.
language
string
optional
default:
Language code (e.g. en, fr, de, ja). When set, enables Inverse Text Normalization — spoken-form numbers, currencies, and units are converted to their written form.
multichannel
boolean
optional
default: false
When true, enables per-channel transcription for interleaved multichannel audio. Requires channels to be set to ≥ 2.
channels
integer
optional
default: 1
Number of interleaved audio channels. Required when multichannel=true. Min: 2, Max: 8.
diarize
boolean
optional
default: false
When true, enables speaker diarization. Words in transcript.partial and transcript.done events include a speaker field (integer) identifying the detected speaker.
Handshake
wss://api.x.ai/v1/stt