Inference API

Voice

View as Markdown

Speech to text - REST

/v1/stt

Transcribe an audio file to text.

Request Body

Response Body

text

string

Full transcript text. For multichannel requests, this is a merged transcript across all channels (words interleaved by timestamp).

language

string

Detected language code (ISO 639-1, e.g. en). Currently empty — language detection is not yet enabled.

duration

number

Audio duration in seconds (rounded to 2 decimal places).

words

array

Word-level segments with timestamps. Omitted when empty.

channels

array

Per-channel transcripts. Only present when multichannel=true. Omitted for single-channel audio.


Speech to text - Streaming

WSS

Real-time streaming speech-to-text via WebSocket. Stream raw audio as binary frames and receive JSON transcript events as the audio is processed. Configuration is done via query parameters at connection time. Each connection handles a single utterance — reconnect to transcribe another.

Handshake

URL

wss://api.x.ai/v1/stt

Method

GET

Status

101 Switching Protocols

Headers

Authorization

string

required

Bearer token authentication. Format: Bearer <your xAI API key>.

Bearer $XAI_API_KEY

Query Parameters

sample_rate

integer

optional

default: 16000

Audio sample rate in Hz. Supported values: 8000, 16000, 22050, 24000, 44100, 48000.

encoding

string

optional

default: pcm

Audio encoding format. pcm — signed 16-bit little-endian (2 bytes/sample). mulaw — G.711 µ-law (1 byte/sample). alaw — G.711 A-law (1 byte/sample).

interim_results

boolean

optional

default: false

When true, the server emits partial transcript events (is_final=false) approximately every 500 ms while audio is being processed. When false (default), only finalized results are sent.

endpointing

integer

optional

default: 10

Silence duration in milliseconds before the server fires a speech_final=true event, indicating the speaker stopped talking. Range: 0–5000. Set to 0 for no delay (fire on any VAD silence boundary). Default: 10ms.

language

string

optional

default:

Language code (e.g. en, fr, de, ja). When set, enables Inverse Text Normalization — spoken-form numbers, currencies, and units are converted to their written form.

multichannel

boolean

optional

default: false

When true, enables per-channel transcription for interleaved multichannel audio. Requires channels to be set to ≥ 2.

channels

integer

optional

default: 1

Number of interleaved audio channels. Required when multichannel=true. Min: 2, Max: 8.

diarize

boolean

optional

default: false

When true, enables speaker diarization. Words in transcript.partial and transcript.done events include a speaker field (integer) identifying the detected speaker.