#### Inference API

# Voice

***

## POST /v1/stt

Transcribe an audio file to text.

### Request Body

* `file` (string) — Audio file to transcribe. Maximum size: 500 MB. Supported container formats (auto-detected): \`wav\`, \`mp3\`, \`ogg\`, \`opus\`, \`flac\`, \`aac\`, \`mp4\`, \`m4a\`, \`mkv\` (MP3/AAC/FLAC codecs only). Supported raw formats (requires \`audio\_format\` and \`sample\_rate\`): \`pcm\`, \`mulaw\`, \`alaw\`. Must be the last field in the multipart form.

* `url` (string) — URL of an audio file to download and transcribe (server-side). Either \`file\` or \`url\` must be provided.

* `audio_format` ("pcm" | "mulaw" | "alaw" | "wav" | "mp3" | "ogg" | "opus" | "flac" | "aac" | "mp4" | "m4a" | "mkv") — Audio format hint. \*\*Only required for raw/headerless formats\*\* (\`pcm\`, \`mulaw\`, \`alaw\`). For container formats (MP3, WAV, OGG, etc.) the server auto-detects the format from the file header — do not set this field.

* `sample_rate` ("8000" | "16000" | "22050" | "24000" | "44100" | "48000") — Audio sample rate in Hz. \*\*Required when \`audio\_format\` is a raw format\*\* (\`pcm\`, \`mulaw\`, \`alaw\`). Ignored for container formats. Either \`sample\_rate\` or \`sample\_rate\_hertz\` may be used.

* `language` (string) — Language code for the audio (e.g. \`en\`, \`fr\`, \`de\`, \`ja\`). When set together with \`format=true\`, enables Inverse Text Normalization — spoken-form numbers, currencies, and units are converted to their written form.

* `format` ("true" | "false") — When \`true\`, enables text formatting. Requires \`language\` to be set.

* `multichannel` ("true" | "false") — When \`true\`, enables per-channel transcription. Each audio channel is transcribed independently and results are returned in the \`channels\` array.

* `channels` (integer) — Number of audio channels. Required for multichannel raw audio (min 2, max 8). For container formats, the channel count is auto-detected from the file header.

* `diarize` ("true" | "false") — When \`true\`, enables speaker diarization. Each word in the response includes a \`speaker\` field (integer) identifying the detected speaker.

### Response Body

* `text` (string, required) — Full transcript text. For multichannel requests, this is a merged transcript across all channels (words interleaved by timestamp).

* `language` (string, required) — Detected language code (ISO 639-1, e.g. \`en\`). Currently empty — language detection is not yet enabled.

* `duration` (number, required) — Audio duration in seconds (rounded to 2 decimal places).

* `words` (array\<object>) — Word-level segments with timestamps. Omitted when empty.

  * `text` (string, required) — The word text.

  * `start` (number, required) — Word start time in seconds (2 d.p.).

  * `end` (number, required) — Word end time in seconds (2 d.p.).

  * `confidence` (number) — Confidence score (0.0–1.0, entropy-based). Omitted when 0.

  * `speaker` (integer) — Speaker index (0-based). Only present when \`diarize=true\`.

* `channels` (array\<object>) — Per-channel transcripts. Only present when \`multichannel=true\`. Omitted for single-channel audio.

  * `index` (integer, required) — Zero-based channel index in the source audio.

  * `language` (string) — Detected language code for this channel. Currently empty.

  * `text` (string, required) — Full transcript text for this channel.

  * `words` (array\<object>) — Word-level segments with timestamps for this channel.

    * `text` (string, required) — The word text.

    * `start` (number, required) — Word start time in seconds (2 d.p.).

    * `end` (number, required) — Word end time in seconds (2 d.p.).

    * `confidence` (number) — Confidence score (0.0–1.0, entropy-based). Omitted when 0.

    * `speaker` (integer) — Speaker index (0-based). Only present when \`diarize=true\`.

\*\*Response example:\*\*

```json
{
  "text": "The balance is $167,983.15. That is $23.4 kilograms.",
  "language": "",
  "duration": 8.4,
  "words": [
    {
      "text": "The",
      "start": 0,
      "end": 0.24,
      "confidence": 0.33
    },
    {
      "text": "balance",
      "start": 0.24,
      "end": 0.64,
      "confidence": 0.67
    },
    {
      "text": "is",
      "start": 0.64,
      "end": 0.88,
      "confidence": 0.41
    },
    {
      "text": "$167,983.15.",
      "start": 0.88,
      "end": 4.8,
      "confidence": 0.07
    },
    {
      "text": "That",
      "start": 6.16,
      "end": 6.48,
      "confidence": 0.29
    },
    {
      "text": "is",
      "start": 6.48,
      "end": 6.64,
      "confidence": 0.4
    },
    {
      "text": "$23.4",
      "start": 6.64,
      "end": 7.52,
      "confidence": 0.07
    },
    {
      "text": "kilograms.",
      "start": 7.76,
      "end": 8.4,
      "confidence": 0.09
    }
  ]
}
```

***

## Speech to text - Streaming

WebSocket endpoint: `wss://api.x.ai/v1/stt`

Real-time streaming speech-to-text via WebSocket. Stream raw audio as binary frames and receive JSON transcript events as the audio is processed. Configuration is done via query parameters at connection time. Each connection handles a single utterance — reconnect to transcribe another.

Full schemas and examples: [`/stt-streaming.ws.json`](/stt-streaming.ws.json)

### Query Parameters

* `sample_rate` (integer) — Audio sample rate in Hz. Supported values: \`8000\`, \`16000\`, \`22050\`, \`24000\`, \`44100\`, \`48000\`.

* `encoding` (string) — Audio encoding format. \`pcm\` — signed 16-bit little-endian (2 bytes/sample). \`mulaw\` — G.711 µ-law (1 byte/sample). \`alaw\` — G.711 A-law (1 byte/sample).

* `interim_results` (boolean) — When \`true\`, the server emits partial transcript events (\`is\_final=false\`) approximately every 500 ms while audio is being processed. When \`false\` (default), only finalized results are sent.

* `endpointing` (integer) — Silence duration in milliseconds before the server fires a \`speech\_final=true\` event, indicating the speaker stopped talking. Range: 0–5000. Set to \`0\` for no delay (fire on any VAD silence boundary). Default: 10ms.

* `language` (string) — Language code (e.g. \`en\`, \`fr\`, \`de\`, \`ja\`). When set, enables Inverse Text Normalization — spoken-form numbers, currencies, and units are converted to their written form.

* `multichannel` (boolean) — When \`true\`, enables per-channel transcription for interleaved multichannel audio. Requires \`channels\` to be set to ≥ 2.

* `channels` (integer) — Number of interleaved audio channels. Required when \`multichannel=true\`. Min: 2, Max: 8.

* `diarize` (boolean) — When \`true\`, enables speaker diarization. Words in \`transcript.partial\` and \`transcript.done\` events include a \`speaker\` field (integer) identifying the detected speaker.

### Client Messages

* `Binary frame (audio)` — Send raw audio as binary WebSocket frames in the encoding specified by the \`encoding\` query parameter. Audio should be streamed in real-time-paced chunks (e.g. 100 ms at a time). No base64 encoding — send raw bytes directly.

* `audio.done` — Signal that all audio has been sent. The server flushes any remaining buffered audio, emits final transcript events, and sends a \`transcript.done\` event. The connection closes after \`transcript.done\`.

### Server Messages

* `transcript.created` — Sent immediately after the WebSocket connection is established and the server is ready to receive audio. \*\*Wait for this event before sending audio\*\* — the server needs to initialize its ASR backend.

* `transcript.partial` — A transcript result for a portion of the audio stream. Two boolean fields convey state: interim (\`is\_final=false\`) means text may still change, chunk final (\`is\_final=true\`, \`speech\_final=false\`) means the chunk is locked, and utterance final (\`is\_final=true\`, \`speech\_final=true\`) means the speaker stopped talking.

* `transcript.done` — Sent after the client sends \`audio.done\`. Contains any remaining transcript not already delivered via a \`speech\_final=true\` event, plus the total audio \`duration\`. The server closes the connection after this event.

* `error` — An error occurred during the session. Most errors (pipeline failures, stream timeouts) close the connection. Only client message parse errors keep the connection open.

### Example Message Flow

1. `transcript.created` (server)

2. `Binary frame (audio)` (client)

3. `Binary frame (audio)` (client)

4. `transcript.partial` (server)

5. `Binary frame (audio)` (client)

6. `transcript.partial` (server)

7. `Binary frame (audio)` (client)

8. `transcript.partial` (server)

9. `audio.done` (client)

10. `transcript.done` (server)
