Model Capabilities

Speech to Text

View as Markdown

Transcribe audio files into text with a single API call, or stream audio in real time over WebSocket. The API supports 12 audio formats, word-level timestamps, multichannel transcription, and text formatting.

Quick Start

Transcribe an audio file with a single API call:

import os
import requests

response = requests.post(
    "https://api.x.ai/v1/stt",
    headers={"Authorization": f"Bearer {os.environ['XAI_API_KEY']}"},
    files={"file": ("audio.mp3", open("audio.mp3", "rb"), "audio/mpeg")},
    data={"format": "true", "language": "en"},
)
response.raise_for_status()

result = response.json()
print(result["text"])
print(f"Duration: {result['duration']}s")
for word in result.get("words", []):
    print(f"  {word['start']:.2f}s - {word['end']:.2f}s: {word['text']}")

Note: The file parameter must always be the last parameter in the curl command


Supported Languages

The language parameter enables formatting for the following languages. The model transcribes speech in any of these languages regardless of the language parameter — setting it enables formatting of numbers, currencies, and units into their written form.

LanguageCodeLanguageCode
ArabicarMacedonianmk
CzechcsMalayms
DanishdaPersianfa
DutchnlPolishpl
EnglishenPortuguesept
FilipinofilRomanianro
FrenchfrRussianru
GermandeSpanishes
HindihiSwedishsv
IndonesianidThaith
ItalianitTurkishtr
JapanesejaVietnamesevi
Koreanko

Request Body

The request uses multipart/form-data. Either file or url must be provided.

ParameterTypeDefaultRequiredDescription
filefile✓†Audio file to transcribe. Max 500 MB. See Supported Formats.
urlstring✓†URL of an audio file to download and transcribe (server-side).
audio_formatstringFormat hint for raw/headerless audio: pcm, mulaw, alaw. Container formats are auto-detected — do not set this field for MP3, WAV, etc.
sample_rateintegerSample rate in Hz. Required for raw audio (pcm, mulaw, alaw). Supported: 8000, 16000, 22050, 24000, 44100, 48000.
languagestringLanguage code (e.g. en, fr, de). Used with format=true to enable text formatting. See Supported Languages.
formatbooleanfalseWhen true, enables Inverse Text Normalization — converts spoken numbers/currency to written form (e.g. "one hundred dollars" → "$100"). Requires language.
multichannelbooleanfalseWhen true, transcribes each audio channel independently. Results returned in the channels array.
channelsintegerNumber of audio channels (2–8). Required for multichannel raw audio. Auto-detected for container formats.
diarizebooleanfalseWhen true, enables speaker diarization. Each word in the response includes a speaker field (integer) identifying the detected speaker.

† Either file or url must be provided.

Example with text formatting

Bash

curl -X POST https://api.x.ai/v1/stt \
  -H "Authorization: Bearer $XAI_API_KEY" \
  -F format=true \
  -F language=en \
  -F file=@meeting.mp3

Note: The file parameter must always be the last parameter in the curl command


Response

The response includes the full transcript, audio duration, and word-level timestamps.

JSON

{
  "text": "The balance is $167,983.15.",
  "language": "English",
  "duration": 3.45,
  "words": [
    { "text": "The", "start": 0.24, "end": 0.48 },
    { "text": "balance", "start": 0.48, "end": 0.96 },
    { "text": "is", "start": 0.96, "end": 1.12 },
    { "text": "$167,983.15.", "start": 1.12, "end": 3.20 }
  ]
}
FieldTypeDescription
textstringFull transcript text.
languagestringDetected language name (e.g. "English", "French").
durationnumberAudio duration in seconds (2 d.p.).
wordsarrayWord-level segments with text, start, end, and speaker (integer, only when diarize=true).
channelsarrayPer-channel transcripts (only when multichannel=true). Each entry has index, text, and words.

Supported Audio Formats

Container formats (auto-detected)

FormatExtensionDescription
WAV.wavWaveform Audio — lossless, best quality input
MP3.mp3MPEG Audio Layer 3 — widely supported
OGG.oggOgg container — open format
Opus.opusOpus codec — low-latency, high quality
FLAC.flacFree Lossless Audio Codec — lossless compression
AAC.aacAdvanced Audio Coding
MP4.mp4MPEG-4 container
M4A.m4aMPEG-4 Audio — Apple ecosystem standard
MKV.mkvMatroska container — supports MP3, AAC, and FLAC audio codecs

Raw formats (require audio_format and sample_rate)

Formataudio_format valueDescription
PCMpcmSigned 16-bit little-endian (2 bytes/sample)
µ-lawmulawG.711 µ-law (1 byte/sample)
A-lawalawG.711 A-law (1 byte/sample)

Limits

  • Max file size: 500 MB
  • Channels: Mono, stereo, or up to 8 channels (with multichannel=true)
  • Sample rates: 8000, 16000, 22050, 24000, 44100, 48000 Hz

Streaming Speech-to-Text (WebSocket)

For real-time transcription, use the WebSocket API at wss://api.x.ai/v1/stt. The client streams raw audio as binary WebSocket frames and receives JSON transcript events as the audio is processed.

Endpoint: wss://api.x.ai/v1/stt

Configuration is done via URL query parameters — no setup message required. Audio is sent as raw binary frames (no base64 encoding).

Never expose your API key in client-side code. Always proxy WebSocket connections through your backend.


Query Parameters

ParameterTypeDefaultDescription
sample_rateinteger16000Audio sample rate in Hz.
encodingstringpcmAudio encoding: pcm, mulaw, or alaw.
interim_resultsbooleanfalseWhen true, emit partial transcripts is_final=false every ~500 ms.
endpointinginteger10Silence duration (ms) before utterance-final event. Range: 0–5000. 0 = fire on any VAD silence boundary.
languagestringLanguage code for text formatting. See Supported Languages.
diarizebooleanWhen true, enables speaker diarization. Words include a speaker field identifying the detected speaker.
multichannelbooleanfalsePer-channel transcription. Requires channels ≥ 2.
channelsinteger1Number of interleaved audio channels (max 8).

Server Events

EventDescription
transcript.createdServer ready — wait for this before sending audio.
transcript.partialTranscript result with text, words, is_final, speech_final, start, duration. Includes channel_index when multichannel=true.
transcript.doneEnd of turn after audio.done. Contains remaining transcript, or empty text/words if speech_final already covered all audio. duration always present. Includes channel_index when multichannel=true — one event is sent per channel.
errorError with message field. Connection stays open.

The transcript.partial event uses is_final and speech_final to convey three states:

is_finalspeech_finalMeaning
falsefalseInterim — text may change (only when interim_results=true)
truefalseChunk final — text locked, ~3s of speech finalized
truetrueUtterance final — speaker stopped, complete stitched utterance

Client Messages

  • Binary frames — raw audio in the specified encoding (streamed in real-time-paced chunks, e.g. 100 ms)
  • {"type": "audio.done"} — signal end of audio, triggers transcript.done

After transcript.done, the server resets and is ready for a new turn without reconnecting. When multichannel=true, wait for one transcript.done per channel before starting a new turn.


Multichannel Streaming

When multichannel=true and channels ≥ 2, the server transcribes each audio channel independently. Send interleaved multichannel PCM (e.g. L,R,L,R,… for stereo) as binary frames, and the server de-interleaves and processes each channel in parallel.

How it works:

  • transcript.created is sent once (session-level — no channel_index).
  • transcript.partial events include a channel_index field (0-based) identifying the source channel. Events from different channels arrive interleaved.
  • transcript.done is sent once per channel after audio.done, each with its own channel_index.
  • Chunk sizes should account for all channels — e.g. for stereo PCM16 at 16 kHz, 100 ms = 6,400 bytes (3,200 per channel × 2 channels).

Example URL:

Text

wss://api.x.ai/v1/stt?sample_rate=16000&encoding=pcm&multichannel=true&channels=2&interim_results=true

Typical use case: Call center recordings with agent on channel 0 and customer on channel 1, enabling per-speaker transcription without requiring speaker diarization.


Full Example

import asyncio
import json
import os

import websockets

API_KEY = os.environ["XAI_API_KEY"]
WS_URL = "wss://api.x.ai/v1/stt?sample_rate=16000&encoding=pcm&interim_results=true&language=en"

async def transcribe_stream(audio_file: str):
    headers = {"Authorization": f"Bearer {API_KEY}"}

    async with websockets.connect(WS_URL, additional_headers=headers) as ws:
        # Wait for server ready signal
        msg = json.loads(await ws.recv())
        assert msg["type"] == "transcript.created"
        print("Server ready")

        # Read raw PCM from a WAV file (skip 44-byte header)
        with open(audio_file, "rb") as f:
            f.read(44)  # Skip WAV header
            chunk_size = 16000 * 2 // 10  # 100ms of PCM16 at 16kHz

            while chunk := f.read(chunk_size):
                await ws.send(chunk)  # Send raw binary — no base64
                await asyncio.sleep(0.1)

        # Signal end of audio
        await ws.send(json.dumps({"type": "audio.done"}))

        # Collect events until transcript.done
        async for message in ws:
            event = json.loads(message)
            if event["type"] == "transcript.partial":
                prefix = "FINAL" if event["is_final"] else "partial"
                print(f"[{prefix}] {event['text']}")
            elif event["type"] == "transcript.done":
                print(f"\nFull transcript: {event['text']}")
                print(f"Duration: {event['duration']}s")
                break

asyncio.run(transcribe_stream("audio.wav"))

Use Cases

  • Live captions — Real-time subtitles for video calls, meetings, and live streams
  • Voice assistants — Transcribe user speech for natural language understanding pipelines
  • Call centers — Real-time agent assistance with multichannel per-speaker transcription
  • Accessibility — Live transcription for hearing-impaired users
  • Voice commands — Low-latency speech-to-action for hands-free interfaces

Tips for Streaming STT

  • Use 16 kHz sample rate with PCM encoding (sample_rate=16000&encoding=pcm) — this is the model's native rate and avoids resampling on the server
  • Enable interim_results for responsive UX — show transcription as the user speaks
  • Use language=en to enable text formatting — numbers and currencies are written in their standard form
  • Send 100 ms audio chunks (3,200 bytes at 16 kHz PCM16) for a good balance of latency and efficiency
  • Wait for transcript.created before sending audio — the server needs to initialize its ASR backend

Error Handling

StatusMeaningAction
200SuccessTranscription in the response body
400Bad requestMissing file/url, unsupported format, missing sample_rate for raw audio, format=true without language
401UnauthorizedAPI key is missing or invalid
413Payload too largeFile exceeds 500 MB
429Rate limitedBack off and retry with exponential delay
502Bad gatewayURL download failed (when using url)
503Service unavailableBackend not available — retry