#### Model Capabilities

# Speech to Text

Transcribe audio files into text with a single API call, or stream audio in real time over WebSocket. The API supports 12 audio formats, word-level timestamps, multichannel transcription, and text formatting.

## Quick Start

Transcribe an audio file with a single API call:

```bash
curl -X POST https://api.x.ai/v1/stt \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -F format=true \
  -F language=en \
  -F file=@audio.mp3
```

```python customLanguage="pythonWithoutSDK"
import os
import requests

response = requests.post(
    "https://api.x.ai/v1/stt",
    headers={"Authorization": f"Bearer {os.environ['XAI_API_KEY']}"},
    files={"file": ("audio.mp3", open("audio.mp3", "rb"), "audio/mpeg")},
    data={"format": "true", "language": "en"},
)
response.raise_for_status()

result = response.json()
print(result["text"])
print(f"Duration: {result['duration']}s")
for word in result.get("words", []):
    print(f"  {word['start']:.2f}s - {word['end']:.2f}s: {word['text']}")
```

```javascript customLanguage="javascriptWithoutSDK"
import fs from "fs";

const formData = new FormData();
formData.append("file", new Blob([fs.readFileSync("audio.mp3")]), "audio.mp3");
formData.append("format", "true");
formData.append("language", "en");

const response = await fetch("https://api.x.ai/v1/stt", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${process.env.XAI_API_KEY}`,
  },
  body: formData,
});

if (!response.ok) throw new Error(`STT error ${response.status}`);

const result = await response.json();
console.log(result.text);
console.log(`Duration: ${result.duration}s`);
for (const word of result.words ?? []) {
  console.log(`  ${word.start.toFixed(2)}s - ${word.end.toFixed(2)}s: ${word.text}`);
}
```

Note: The `file` parameter must always be the last parameter in the curl command

[Get API Key →](https://console.x.ai/team/default/api-keys?campaign=voice-docs-stt)

[Live Voice Demos](https://x.ai/api/voice)

## Supported Languages

The `language` parameter enables formatting for the following languages. The model transcribes speech in any of these languages regardless of the `language` parameter — setting it enables formatting of numbers, currencies, and units into their written form.

| Language | Code | | Language | Code |
|----------|------|-|----------|------|
| Arabic | `ar` | | Macedonian | `mk` |
| Czech | `cs` | | Malay | `ms` |
| Danish | `da` | | Persian | `fa` |
| Dutch | `nl` | | Polish | `pl` |
| English | `en` | | Portuguese | `pt` |
| Filipino | `fil` | | Romanian | `ro` |
| French | `fr` | | Russian | `ru` |
| German | `de` | | Spanish | `es` |
| Hindi | `hi` | | Swedish | `sv` |
| Indonesian | `id` | | Thai | `th` |
| Italian | `it` | | Turkish | `tr` |
| Japanese | `ja` | | Vietnamese | `vi` |
| Korean | `ko` | | | |

## Request Body

The request uses `multipart/form-data`. Either `file` or `url` must be provided.

| Parameter | Type | Default | Required | Description |
|-----------|------|---------|----------|-------------|
| `file` | file | | ✓† | Audio file to transcribe. Max **500 MB**. See [Supported Formats](#supported-audio-formats). |
| `url` | string | | ✓† | URL of an audio file to download and transcribe (server-side). |
| `audio_format` | string | | | Format hint for raw/headerless audio: `pcm`, `mulaw`, `alaw`. Container formats are auto-detected — do not set this field for MP3, WAV, etc. |
| `sample_rate` | integer | | | Sample rate in Hz. **Required for raw audio** (`pcm`, `mulaw`, `alaw`). Supported: `8000`, `16000`, `22050`, `24000`, `44100`, `48000`. |
| `language` | string | | | Language code (e.g. `en`, `fr`, `de`). Used with `format=true` to enable text formatting. See [Supported Languages](#supported-languages). |
| `format` | boolean | `false` | | When `true`, enables Inverse Text Normalization — converts spoken numbers/currency to written form (e.g. "one hundred dollars" → "$100"). Requires `language`. |
| `multichannel` | boolean | `false` | | When `true`, transcribes each audio channel independently. Results returned in the `channels` array. |
| `channels` | integer | | | Number of audio channels (2–8). Required for multichannel raw audio. Auto-detected for container formats. |
| `diarize` | boolean | `false` | | When `true`, enables speaker diarization. Each word in the response includes a `speaker` field (integer) identifying the detected speaker. |

† Either `file` or `url` must be provided.

### Example with text formatting

```bash
curl -X POST https://api.x.ai/v1/stt \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -F format=true \
  -F language=en \
  -F file=@meeting.mp3
```

Note: The `file` parameter must always be the last parameter in the curl command

## Response

The response includes the full transcript, audio duration, and word-level timestamps.

```json
{
  "text": "The balance is $167,983.15.",
  "language": "English",
  "duration": 3.45,
  "words": [
    { "text": "The", "start": 0.24, "end": 0.48 },
    { "text": "balance", "start": 0.48, "end": 0.96 },
    { "text": "is", "start": 0.96, "end": 1.12 },
    { "text": "$167,983.15.", "start": 1.12, "end": 3.20 }
  ]
}
```

| Field | Type | Description |
|-------|------|-------------|
| `text` | string | Full transcript text. |
| `language` | string | Detected language name (e.g. `"English"`, `"French"`). |
| `duration` | number | Audio duration in seconds (2 d.p.). |
| `words` | array | Word-level segments with `text`, `start`, `end`, and `speaker` (integer, only when `diarize=true`). |
| `channels` | array | Per-channel transcripts (only when `multichannel=true`). Each entry has `index`, `text`, and `words`. |

## Supported Audio Formats

### Container formats (auto-detected)

| Format | Extension | Description |
|--------|-----------|-------------|
| WAV | `.wav` | Waveform Audio — lossless, best quality input |
| MP3 | `.mp3` | MPEG Audio Layer 3 — widely supported |
| OGG | `.ogg` | Ogg container — open format |
| Opus | `.opus` | Opus codec — low-latency, high quality |
| FLAC | `.flac` | Free Lossless Audio Codec — lossless compression |
| AAC | `.aac` | Advanced Audio Coding |
| MP4 | `.mp4` | MPEG-4 container |
| M4A | `.m4a` | MPEG-4 Audio — Apple ecosystem standard |
| MKV | `.mkv` | Matroska container — supports MP3, AAC, and FLAC audio codecs |

### Raw formats (require `audio_format` and `sample_rate`)

| Format | `audio_format` value | Description |
|--------|---------------------|-------------|
| PCM | `pcm` | Signed 16-bit little-endian (2 bytes/sample) |
| µ-law | `mulaw` | G.711 µ-law (1 byte/sample) |
| A-law | `alaw` | G.711 A-law (1 byte/sample) |

### Limits

* **Max file size:** 500 MB
* **Channels:** Mono, stereo, or up to 8 channels (with `multichannel=true`)
* **Sample rates:** 8000, 16000, 22050, 24000, 44100, 48000 Hz

## Streaming Speech-to-Text (WebSocket)

For real-time transcription, use the WebSocket API at `wss://api.x.ai/v1/stt`. The client streams raw audio as binary WebSocket frames and receives JSON transcript events as the audio is processed.

**Endpoint:** `wss://api.x.ai/v1/stt`

Configuration is done via URL query parameters — no setup message required. Audio is sent as raw binary frames (no base64 encoding).

**Never expose your API key in client-side code.** Always proxy WebSocket connections through your backend.

### Query Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `sample_rate` | integer | `16000` | Audio sample rate in Hz. |
| `encoding` | string | `pcm` | Audio encoding: `pcm`, `mulaw`, or `alaw`. |
| `interim_results` | boolean | `false` | When `true`, emit partial transcripts `is_final=false` every ~500 ms. |
| `endpointing` | integer | `10` | Silence duration (ms) before utterance-final event. Range: 0–5000. `0` = fire on any VAD silence boundary. |
| `language` | string | | Language code for text formatting. See [Supported Languages](#supported-languages). |
| `diarize` | boolean | | When `true`, enables speaker diarization. Words include a `speaker` field identifying the detected speaker. |
| `multichannel` | boolean | `false` | Per-channel transcription. Requires `channels` ≥ 2. |
| `channels` | integer | `1` | Number of interleaved audio channels (max 8). |

### Server Events

| Event | Description |
|-------|-------------|
| `transcript.created` | Server ready — wait for this before sending audio. |
| `transcript.partial` | Transcript result with `text`, `words`, `is_final`, `speech_final`, `start`, `duration`. Includes `channel_index` when `multichannel=true`. |
| `transcript.done` | End of turn after `audio.done`. Contains remaining transcript, or empty `text`/`words` if `speech_final` already covered all audio. `duration` always present. Includes `channel_index` when `multichannel=true` — one event is sent per channel. |
| `error` | Error with `message` field. Connection stays open. |

The `transcript.partial` event uses `is_final` and `speech_final` to convey three states:

| `is_final` | `speech_final` | Meaning |
|:---:|:---:|---|
| `false` | `false` | **Interim** — text may change (only when `interim_results=true`) |
| `true` | `false` | **Chunk final** — text locked, ~3s of speech finalized |
| `true` | `true` | **Utterance final** — speaker stopped, complete stitched utterance |

### Client Messages

* **Binary frames** — raw audio in the specified encoding (streamed in real-time-paced chunks, e.g. 100 ms)
* **`{"type": "audio.done"}`** — signal end of audio, triggers `transcript.done`

After `transcript.done`, the server resets and is ready for a new turn without reconnecting. When `multichannel=true`, wait for one `transcript.done` per channel before starting a new turn.

### Multichannel Streaming

When `multichannel=true` and `channels` ≥ 2, the server transcribes each audio channel independently. Send interleaved multichannel PCM (e.g. L,R,L,R,… for stereo) as binary frames, and the server de-interleaves and processes each channel in parallel.

**How it works:**

* `transcript.created` is sent once (session-level — no `channel_index`).
* `transcript.partial` events include a `channel_index` field (0-based) identifying the source channel. Events from different channels arrive interleaved.
* `transcript.done` is sent **once per channel** after `audio.done`, each with its own `channel_index`.
* Chunk sizes should account for all channels — e.g. for stereo PCM16 at 16 kHz, 100 ms = 6,400 bytes (3,200 per channel × 2 channels).

**Example URL:**

```
wss://api.x.ai/v1/stt?sample_rate=16000&encoding=pcm&multichannel=true&channels=2&interim_results=true
```

**Typical use case:** Call center recordings with agent on channel 0 and customer on channel 1, enabling per-speaker transcription without requiring speaker diarization.

### Full Example

```python customLanguage="pythonWithoutSDK"
import asyncio
import json
import os

import websockets

API_KEY = os.environ["XAI_API_KEY"]
WS_URL = "wss://api.x.ai/v1/stt?sample_rate=16000&encoding=pcm&interim_results=true&language=en"

async def transcribe_stream(audio_file: str):
    headers = {"Authorization": f"Bearer {API_KEY}"}

    async with websockets.connect(WS_URL, additional_headers=headers) as ws:
        # Wait for server ready signal
        msg = json.loads(await ws.recv())
        assert msg["type"] == "transcript.created"
        print("Server ready")

        # Read raw PCM from a WAV file (skip 44-byte header)
        with open(audio_file, "rb") as f:
            f.read(44)  # Skip WAV header
            chunk_size = 16000 * 2 // 10  # 100ms of PCM16 at 16kHz

            while chunk := f.read(chunk_size):
                await ws.send(chunk)  # Send raw binary — no base64
                await asyncio.sleep(0.1)

        # Signal end of audio
        await ws.send(json.dumps({"type": "audio.done"}))

        # Collect events until transcript.done
        async for message in ws:
            event = json.loads(message)
            if event["type"] == "transcript.partial":
                prefix = "FINAL" if event["is_final"] else "partial"
                print(f"[{prefix}] {event['text']}")
            elif event["type"] == "transcript.done":
                print(f"\nFull transcript: {event['text']}")
                print(f"Duration: {event['duration']}s")
                break

asyncio.run(transcribe_stream("audio.wav"))
```

```javascript customLanguage="javascriptWithoutSDK"
import fs from "fs";
import WebSocket from "ws";

const apiKey = process.env.XAI_API_KEY;
const url = "wss://api.x.ai/v1/stt?sample_rate=16000&encoding=pcm&interim_results=true&language=en";

const ws = new WebSocket(url, { headers: { Authorization: `Bearer ${apiKey}` } });

ws.on("open", () => console.log("Connected"));

ws.on("message", (data) => {
  const event = JSON.parse(data);
  switch (event.type) {
    case "transcript.created":
      console.log("Server ready — streaming audio...");
      // Read WAV file, skip 44-byte header, send 100ms chunks
      const audio = fs.readFileSync("audio.wav").slice(44);
      const chunkSize = 3200; // 100ms at 16kHz, 16-bit
      let offset = 0;
      const interval = setInterval(() => {
        if (offset >= audio.length) {
          clearInterval(interval);
          ws.send(JSON.stringify({ type: "audio.done" }));
          return;
        }
        ws.send(audio.slice(offset, offset + chunkSize));
        offset += chunkSize;
      }, 100);
      break;
    case "transcript.partial":
      const prefix = event.is_final ? "FINAL" : "partial";
      console.log(`[${prefix}] ${event.text}`);
      break;
    case "transcript.done":
      console.log(`\nFull transcript: ${event.text}`);
      console.log(`Duration: ${event.duration}s`);
      ws.close();
      break;
  }
});
```

### Use Cases

* **Live captions** — Real-time subtitles for video calls, meetings, and live streams
* **Voice assistants** — Transcribe user speech for natural language understanding pipelines
* **Call centers** — Real-time agent assistance with multichannel per-speaker transcription
* **Accessibility** — Live transcription for hearing-impaired users
* **Voice commands** — Low-latency speech-to-action for hands-free interfaces

### Tips for Streaming STT

* **Use 16 kHz sample rate with PCM encoding** (`sample_rate=16000&encoding=pcm`) — this is the model's native rate and avoids resampling on the server
* **Enable `interim_results`** for responsive UX — show transcription as the user speaks
* **Use `language=en`** to enable text formatting — numbers and currencies are written in their standard form
* **Send 100 ms audio chunks** (3,200 bytes at 16 kHz PCM16) for a good balance of latency and efficiency
* **Wait for `transcript.created`** before sending audio — the server needs to initialize its ASR backend

## Error Handling

| Status | Meaning | Action |
|--------|---------|--------|
| `200` | Success | Transcription in the response body |
| `400` | Bad request | Missing `file`/`url`, unsupported format, missing `sample_rate` for raw audio, `format=true` without `language` |
| `401` | Unauthorized | API key is missing or invalid |
| `413` | Payload too large | File exceeds 500 MB |
| `429` | Rate limited | Back off and retry with exponential delay |
| `502` | Bad gateway | URL download failed (when using `url`) |
| `503` | Service unavailable | Backend not available — retry |

## Related

* [Voice Overview](/developers/model-capabilities/audio/voice) — Overview of all xAI voice capabilities
* [Text to Speech](/developers/model-capabilities/audio/text-to-speech) — Convert text to speech
* [API Reference — Speech to text](/developers/rest-api-reference/inference/voice#speech-to-text---rest) — Full REST endpoint specification
* [API Reference — Streaming](/developers/rest-api-reference/inference/voice#speech-to-text---streaming) — WebSocket streaming specification
