/v1/stt

Transcribe an audio file to text.

Request Body

Response Body

text

string

Full transcript text. For multichannel requests, this is a merged transcript across all channels (words interleaved by timestamp).

language

string

Detected language code (ISO 639-1, e.g. en). Currently empty — language detection is not yet enabled.

duration

number

Audio duration in seconds (rounded to 2 decimal places).

words

array

Word-level segments with timestamps. Omitted when empty.

channels

array

Per-channel transcripts. Only present when multichannel=true. Omitted for single-channel audio.

POST

/v1/stt

JSON

No parameters.

200

Response

JSON

{
  "text": "The balance is $167,983.15. That is $23.4 kilograms.",
  "language": "",
  "duration": 8.4,
  "words": [
    {
      "text": "The",
      "start": 0,
      "end": 0.24,
      "confidence": 0.33
    },
    {
      "text": "balance",
      "start": 0.24,
      "end": 0.64,
      "confidence": 0.67
    },
    {
      "text": "is",
      "start": 0.64,
      "end": 0.88,
      "confidence": 0.41
    },
    {
      "text": "$167,983.15.",
      "start": 0.88,
      "end": 4.8,
      "confidence": 0.07
    },
    {
      "text": "That",
      "start": 6.16,
      "end": 6.48,
      "confidence": 0.29
    },
    {
      "text": "is",
      "start": 6.48,
      "end": 6.64,
      "confidence": 0.4
    },
    {
      "text": "$23.4",
      "start": 6.64,
      "end": 7.52,
      "confidence": 0.07
    },
    {
      "text": "kilograms.",
      "start": 7.76,
      "end": 8.4,
      "confidence": 0.09
    }
  ]
}

Speech to text - Streaming

WSS

Real-time streaming speech-to-text via WebSocket. Stream raw audio as binary frames and receive JSON transcript events as the audio is processed. Configuration is done via query parameters at connection time. Each connection handles a single utterance — reconnect to transcribe another.

Handshake

URL

wss://api.x.ai/v1/stt

Method

GET

Status

101 Switching Protocols

Headers

Authorization

string

required

Bearer token authentication. Format: Bearer <your xAI API key>.

Bearer $XAI_API_KEY

Query Parameters

sample_rate

integer

optional

default: 16000

Audio sample rate in Hz. Supported values: 8000, 16000, 22050, 24000, 44100, 48000.

encoding

string

optional

default: pcm

Audio encoding format. pcm — signed 16-bit little-endian (2 bytes/sample). mulaw — G.711 µ-law (1 byte/sample). alaw — G.711 A-law (1 byte/sample).

interim_results

boolean

optional

default: false

When true, the server emits partial transcript events (is_final=false) approximately every 500 ms while audio is being processed. When false (default), only finalized results are sent.

endpointing

integer

optional

default: 10

Silence duration in milliseconds before the server fires a speech_final=true event, indicating the speaker stopped talking. Range: 0–5000. Set to 0 for no delay (fire on any VAD silence boundary). Default: 10ms.

language

string

optional

default:

Language code (e.g. en, fr, de, ja). When set, enables Inverse Text Normalization — spoken-form numbers, currencies, and units are converted to their written form.

multichannel

boolean

optional

default: false

When true, enables per-channel transcription for interleaved multichannel audio. Requires channels to be set to ≥ 2.

channels

integer

optional

default: 1

Number of interleaved audio channels. Required when multichannel=true. Min: 2, Max: 8.

diarize

boolean

optional

default: false

When true, enables speaker diarization. Words in transcript.partial and transcript.done events include a speaker field (integer) identifying the detected speaker.

keyterm

string (repeatable)

optional

A key term to bias transcription toward (e.g. product names, proper nouns). Repeat the parameter for each term (e.g. keyterm=Understand+The+Universe). Max 100 terms, each up to 50 characters.

filler_words

boolean

optional

default: false

When true, filler words (e.g. uh, um, er) are included in the transcript. When false (default), filler words are automatically removed from the transcript text and the words array.

Handshake

URLwss://api.x.ai/v1/stt

MethodGET

Status101 Switching Protocols

Example Message Flow

↓transcript.created

↑Binary frame (audio)

↓transcript.partial

↑Binary frame (audio)

↓transcript.partial

↑Binary frame (audio)

↓transcript.partial

↑audio.done

↓transcript.done

↑

Client → Server

↑

Binary frame (audio)

Send raw audio as binary WebSocket frames in the encoding specified by the encoding query parameter. Audio should be streamed in real-time-paced chunks (e.g. 100 ms at a time). No base64 encoding — send raw bytes directly.

↑

audio.done

Signal that all audio has been sent. The server flushes any remaining buffered audio, emits final transcript events, and sends a transcript.done event. The connection closes after transcript.done.

↓

Server → Client

↓

transcript.created

Sent immediately after the WebSocket connection is established and the server is ready to receive audio. **Wait for this event before sending audio** — the server needs to initialize its ASR backend.

↓

transcript.partial

A transcript result for a portion of the audio stream. Two boolean fields convey state: interim (is_final=false) means text may still change, chunk final (is_final=true, speech_final=false) means the chunk is locked, and utterance final (is_final=true, speech_final=true) means the speaker stopped talking.

↓

transcript.done

Final transcript after audio.done. duration always present. One per channel when multichannel=true. Connection closes after this event.

↓

error

An error occurred during the session. Most errors (pipeline failures, stream timeouts) close the connection. Only client message parse errors keep the connection open.

Inference API

Voice

Speech to text - REST

Request Body

Response Body

Speech to text - Streaming

Handshake

Headers

Query Parameters

Client → Server

Server → Client