Model Capabilities

Voice Agent API

View as Markdown

Build real-time voice applications powered by Grok. Stream audio and text bidirectionally via WebSocket for voice assistants, phone agents, and interactive voice systems.

Quick Start

Connect to the Voice Agent API and start a conversation:

import asyncio
import json
import os
import websockets

async def voice_agent():
    async with websockets.connect(
        "wss://api.x.ai/v1/realtime",
        additional_headers={"Authorization": f"Bearer {os.environ['XAI_API_KEY']}"}
    ) as ws:
        # Configure session
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "voice": "Eve",
                "instructions": "You are a helpful assistant.",
                "turn_detection": {"type": "server_vad"}
            }
        }))
        
        # Send a text message
        await ws.send(json.dumps({
            "type": "conversation.item.create",
            "item": {"type": "message", "role": "user", 
                     "content": [{"type": "input_text", "text": "Hello!"}]}
        }))
        await ws.send(json.dumps({"type": "response.create"}))
        
        # Receive audio/text responses
        async for msg in ws:
            event = json.loads(msg)
            print(f"Event: {event['type']}")

asyncio.run(voice_agent())

Get Started with Our Tester Apps

Authentication

Authenticate your WebSocket connection with either method:

  • Ephemeral Tokens (recommended) — Short-lived tokens for client-side apps (browsers, mobile). Keeps your API key off the client.
  • API Key — Pass your xAI API key directly in the Authorization header. Server-side only. See Connect via WebSocket for examples.

Client Events

Loading events…
Loading events…

Server Events

Loading events…
Loading events…

Session Parameters

ParameterTypeDescription
instructionsstringSystem prompt
voicestringVoice selection: Eve, Ara, Rex, Sal, Leo (see Voice Options)
toolsarrayTools available to the voice agent. Supports file_search, web_search, x_search, mcp, and function types. See Using Tools.
turn_detection.typestring | null"server_vad" for automatic detection, null for manual text turns
turn_detection.thresholdnumber | optionalVAD activation threshold (0.1–0.9). Higher values require louder audio to trigger. Default: 0.85.
turn_detection.silence_duration_msnumber | optionalHow long the user must be silent (in ms) before the server ends the turn (0–10000). Higher values let users pause longer without being cut off.
turn_detection.prefix_padding_msnumber | optionalAmount of audio (in ms) to include before the detected start of speech (0–10000). Helps capture the beginning of words that might otherwise be clipped by the VAD. Default: 333.
audio.input.format.typestringInput format: "audio/pcm", "audio/pcmu", or "audio/pcma"
audio.input.format.ratenumberInput sample rate (PCM only): 8000, 16000, 22050, 24000, 32000, 44100, 48000
audio.output.format.typestringOutput format: "audio/pcm", "audio/pcmu", or "audio/pcma"
audio.output.format.ratenumberOutput sample rate (PCM only): 8000, 16000, 22050, 24000, 32000, 44100, 48000

Available Voices

VoiceTypeToneDescriptionSample
EveFemaleEnergetic, upbeatDefault voice, engaging and enthusiastic
AraFemaleWarm, friendlyBalanced and conversational
RexMaleConfident, clearProfessional and articulate, ideal for business applications
SalNeutralSmooth, balancedVersatile voice suitable for various contexts
LeoMaleAuthoritative, strongDecisive and commanding, suitable for instructional content

Selecting a Voice

Specify the voice in your session configuration using the voice parameter:

# Configure session with a specific voice
session_config = {
    "type": "session.update",
    "session": {
        "voice": "Eve",  # Choose from: Eve, Ara, Rex, Sal, Leo
        "instructions": "You are a helpful assistant.",
        # Audio format settings (these are the defaults if not specified)
        "audio": {
            "input": {"format": {"type": "audio/pcm", "rate": 24000}},
            "output": {"format": {"type": "audio/pcm", "rate": 24000}}
        }
    }
}

await ws.send(json.dumps(session_config))

Audio Formats

FormatEncodingContainer TypesSample Rate
audio/pcmLinear16, Little-endianRaw, WAV, AIFFConfigurable (see below)
audio/pcmuG.711 μ-law (Mulaw)Raw8000 Hz
audio/pcmaG.711 A-lawRaw8000 Hz

Supported Sample Rates

When using audio/pcm format, you can configure the sample rate to one of the following supported values:

Sample RateQualityDescription
8000 HzTelephoneNarrowband, suitable for voice calls
16000 HzWidebandGood for speech recognition
22050 HzStandardBalanced quality and bandwidth
24000 HzHigh (Default)Recommended for most use cases
32000 HzVery HighEnhanced audio clarity
44100 HzCD QualityStandard for music / media
48000 HzProfessionalStudio-grade audio / Web Browser

Configuring Audio Format

You can configure the audio format and sample rate for both input and output in the session configuration:

# Configure audio format with custom sample rate for input and output
session_config = {
    "type": "session.update",
    "session": {
        "audio": {
            "input": {
                "format": {
                    "type": "audio/pcm",  # or "audio/pcmu" or "audio/pcma"
                    "rate": 16000  # Only applicable for audio/pcm
                }
            },
            "output": {
                "format": {
                    "type": "audio/pcm",  # or "audio/pcmu" or "audio/pcma"
                    "rate": 16000  # Only applicable for audio/pcm
                }
            }
        },
        "instructions": "You are a helpful assistant.",
    }
}

await ws.send(json.dumps(session_config))

Receiving and Playing Audio

Decode and play base64 PCM16 audio received from the API. Use the same sample rate as configured:

import base64
import numpy as np

# Configure session with 16kHz sample rate for lower bandwidth (input and output)
session_config = {
    "type": "session.update",
    "session": {
        "instructions": "You are a helpful assistant.",
        "voice": "Eve",
        "turn_detection": {
            "type": "server_vad",
        },
        "audio": {
            "input": {
                "format": {
                    "type": "audio/pcm",
                    "rate": 16000  # 16kHz for lower bandwidth usage
                }
            },
            "output": {
                "format": {
                    "type": "audio/pcm",
                    "rate": 16000  # 16kHz for lower bandwidth usage
                }
            }
        }
    }
}
await ws.send(json.dumps(session_config))

# When processing audio, use the same sample rate
SAMPLE_RATE = 16000

# Convert audio data to PCM16 and base64
def audio_to_base64(audio_data: np.ndarray) -> str:
    """Convert float32 audio array to base64 PCM16 string."""
    # Normalize to [-1, 1] and convert to int16
    audio_int16 = (audio_data * 32767).astype(np.int16)
    # Encode to base64
    audio_bytes = audio_int16.tobytes()
    return base64.b64encode(audio_bytes).decode('utf-8')

# Convert base64 PCM16 to audio data
def base64_to_audio(base64_audio: str) -> np.ndarray:
    """Convert base64 PCM16 string to float32 audio array."""
    # Decode base64
    audio_bytes = base64.b64decode(base64_audio)
    # Convert to int16 array
    audio_int16 = np.frombuffer(audio_bytes, dtype=np.int16)
    # Normalize to [-1, 1]
    return audio_int16.astype(np.float32) / 32768.0

Using Tools with Grok Voice Agent API

The Grok Voice Agent API supports various tools that can be configured in your session to enhance the capabilities of your voice agent. Tools can be configured in the session.update message.

Available Tool Types

  • Collections Search (file_search) - Search through your uploaded document collections
  • Web Search (web_search) - Search the web for current information
  • X Search (x_search) - Search X (Twitter) for posts and information
  • Remote MCP Tools (mcp) - Connect to external MCP (Model Context Protocol) servers for custom tools
  • Custom Functions - Define your own function tools with JSON schemas

Use the file_search tool to enable your voice agent to search through document collections. You'll need to create a collection first using the Collections API.

COLLECTION_ID = "your-collection-id"  # Replace with your collection ID

session_config = {
    "type": "session.update",
    "session": {
        ...
        "tools": [
            {
                "type": "file_search",
                "vector_store_ids": [COLLECTION_ID],
                "max_num_results": 10,
            },
        ],
    },
}

Configure web search and X search tools to give your voice agent access to current information from the web and X (Twitter).

session_config = {
    "type": "session.update",
    "session": {
        ...
        "tools": [
            {
                "type": "web_search",
            },
            {
                "type": "x_search",
                "allowed_x_handles": ["elonmusk", "xai"],
            },
        ],
    },
}

Remote MCP Tools

Use the mcp tool type to connect your voice agent to external MCP (Model Context Protocol) servers. This lets you extend your voice agent with third-party or custom tools without implementing them as client-side functions — xAI manages the MCP server connection and tool execution on your behalf.

session_config = {
    "type": "session.update",
    "session": {
        ...
        "tools": [
            {
                "type": "mcp",
                "server_url": "https://mcp.example.com/mcp",
                "server_label": "my-tools",
            },
        ],
    },
}

MCP Tool Parameters

ParameterRequiredDescription
server_urlYesThe URL of the MCP server. Only Streaming HTTP and SSE transports are supported.
server_labelYesA label to identify the server (used for tool call prefixing).
server_descriptionNoA description of what the server provides.
allowed_toolsNoList of specific tool names to allow. If omitted, all tools from the server are available.
authorizationNoA token set in the Authorization header on requests to the MCP server.
headersNoAdditional headers to include in requests to the MCP server.

Advanced MCP Configuration

You can restrict which tools are available, provide authentication, and add custom headers:

session_config = {
    "type": "session.update",
    "session": {
        ...
        "tools": [
            {
                "type": "mcp",
                "server_url": "https://mcp.example.com/mcp",
                "server_label": "my-tools",
                "server_description": "Custom business tools for order management",
                "allowed_tools": ["lookup_order", "check_inventory"],
                "authorization": "Bearer your-token-here",
                "headers": {
                    "X-Custom-Header": "value"
                },
            },
        ],
    },
}

Multiple MCP Servers

You can connect to multiple MCP servers simultaneously, each providing different capabilities:

session_config = {
    "type": "session.update",
    "session": {
        ...
        "tools": [
            {
                "type": "mcp",
                "server_url": "https://mcp.deepwiki.com/mcp",
                "server_label": "deepwiki",
            },
            {
                "type": "mcp",
                "server_url": "https://your-tools.example.com/mcp",
                "server_label": "custom-tools",
                "allowed_tools": ["search_database", "format_data"],
            },
        ],
    },
}

MCP tools are server-side tools — xAI handles the connection and execution automatically. Unlike custom function tools, you don't need to handle tool call responses in your client code. For more details on MCP tool configuration, see the Remote MCP Tools guide.

Custom Function Tools

You can define custom function tools with JSON schemas to extend your voice agent's capabilities.

session_config = {
    "type": "session.update",
    "session": {
        ...
        "tools": [
            {
                "type": "function",
                "name": "generate_random_number",
                "description": "Generate a random number between min and max values",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "min": {
                            "type": "number",
                            "description": "Minimum value (inclusive)",
                        },
                        "max": {
                            "type": "number",
                            "description": "Maximum value (inclusive)",
                        },
                    },
                    "required": ["min", "max"],
                },
            },
        ],
    },
}

Combining Multiple Tools

You can combine multiple tool types in a single session configuration, including server-side tools (web search, X search, collections, MCP) and client-side function tools:

session_config = {
    "type": "session.update",
    "session": {
        ...
        "tools": [
            {
                "type": "file_search",
                "vector_store_ids": ["your-collection-id"],
                "max_num_results": 10,
            },
            {
                "type": "web_search",
            },
            {
                "type": "x_search",
            },
            {
                "type": "mcp",
                "server_url": "https://mcp.example.com/mcp",
                "server_label": "my-tools",
            },
            {
                "type": "function",
                "name": "generate_random_number",
                "description": "Generate a random number",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "min": {"type": "number"},
                        "max": {"type": "number"},
                    },
                    "required": ["min", "max"],
                },
            },
        ],
    },
}

Server-side tools (web search, X search, collections, and MCP) are executed automatically by xAI — you don't need to handle their responses. Only custom function tools require client-side handling. For more details, see Collections, Web Search, X Search, and Remote MCP Tools.

Handling Function Call Responses

When you define custom function tools, the voice agent will call these functions during conversation. You need to handle these function calls, execute them, and return the results to continue the conversation.

Function Call Flow

  1. Agent decides to call a function → sends response.function_call_arguments.done event
  2. Your code executes the function → processes the arguments and generates a result
  3. Send result back to agent → sends conversation.item.create with the function output
  4. Request continuation → sends response.create to let the agent continue

Complete Example

import json
import websockets

# Define your function implementations
def get_weather(location: str, units: str = "celsius"):
    """Get current weather for a location"""
    # In production, call a real weather API
    return {
        "location": location,
        "temperature": 22,
        "units": units,
        "condition": "Sunny",
        "humidity": 45
    }

def book_appointment(date: str, time: str, service: str):
    """Book an appointment"""
    # In production, interact with your booking system
    import random
    confirmation = f"CONF{random.randint(1000, 9999)}"
    return {
        "status": "confirmed",
        "confirmation_code": confirmation,
        "date": date,
        "time": time,
        "service": service
    }

# Map function names to implementations
FUNCTION_HANDLERS = {
    "get_weather": get_weather,
    "book_appointment": book_appointment
}

async def handle_function_call(ws, event):
    """Handle function call from the voice agent"""
    function_name = event["name"]
    call_id = event["call_id"]
    arguments = json.loads(event["arguments"])

    print(f"Function called: {function_name} with args: {arguments}")

    # Execute the function
    if function_name in FUNCTION_HANDLERS:
        result = FUNCTION_HANDLERS[function_name](**arguments)

        # Send result back to agent
        await ws.send(json.dumps({
            "type": "conversation.item.create",
            "item": {
                "type": "function_call_output",
                "call_id": call_id,
                "output": json.dumps(result)
            }
        }))

        # Request agent to continue with the result
        await ws.send(json.dumps({
            "type": "response.create"
        }))
    else:
        print(f"Unknown function: {function_name}")

# In your WebSocket message handler
async def on_message(ws, message):
    event = json.loads(message)

    # Listen for function calls
    if event["type"] == "response.function_call_arguments.done":
        await handle_function_call(ws, event)
    elif event["type"] == "response.output_audio.delta":
        # Handle audio response
        pass

Function Call Events

EventDirectionDescription
response.function_call_arguments.doneServer → ClientFunction call triggered with complete arguments
conversation.item.create (function_call_output)Client → ServerSend function execution result back
response.createClient → ServerRequest agent to continue processing

Parallel Tool Calling

When the model determines that multiple function calls are needed to fulfill a request, it will emit multiple response.function_call_arguments.done events before any audio response. In this case, you must resolve all function calls and send their results back before emitting response.create.

Expected behavior:

  1. Receive multiple response.function_call_arguments.done events (one per function call)
  2. Execute all functions (can be done in parallel for performance)
  3. Send a conversation.item.create with function_call_output for each function call
  4. Only after all function outputs have been sent, emit a single response.create to continue

Important: Do not send response.create until all function call outputs have been submitted. Sending response.create prematurely will cause the model to respond without the complete context from all tool results.

Best Practices

This section outlines key recommendations for building low-latency, reliable, and natural-feeling voice experiences using the xAI Voice Agent API.

Minimize Perceived Latency – Parallel Initialization

Start the WebSocket connection and microphone input streaming in parallel.

  • Initiate the WebSocket connection (including authentication via ephemeral token or API key) as early as possible — ideally when the voice interface loads or the user opens the mic-enabled screen.
  • Simultaneously begin capturing microphone audio (using getUserMedia in browsers or equivalent APIs on mobile/native platforms).
  • Do not wait for the WebSocket open event before starting to collect microphone samples.

Audio Buffering Example

Javascript

// 1. Immediately request mic access and start capturing
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });

const audioContext = new AudioContext({ sampleRate: 24000 });

const source = audioContext.createMediaStreamSource(stream);
const processor = audioContext.createScriptProcessor(4096, 1, 1); // or AudioWorklet for better perf

source.connect(processor);
processor.connect(audioContext.destination); // optional

// Buffer incoming PCM data immediately
let earlyAudioBuffer = []; // Float32Array[] or Int16Array[]

processor.onaudioprocess = (e) => {
  const input = e.inputBuffer.getChannelData(0);
  earlyAudioBuffer.push(new Float32Array(input)); // or convert to PCM16
};

// 2. In parallel – connect WebSocket (may take time)
const ws = new WebSocket("wss://api.x.ai/v1/realtime", [
  `xai-client-secret.${token}`,
]);

ws.onopen = () => {
  // Send session.update configuration
  ws.send(JSON.stringify({ type: "session.update", session: { ... } }));

  // Flush any buffered audio now that we're connected
  if (earlyAudioBuffer.length > 0) {
    flushBufferedAudioToWS(earlyAudioBuffer);
    earlyAudioBuffer = [];
  }
};

Tips for Production

  • Convert to 24 kHz PCM16 little-endian before buffering or flushing.
  • Flush in reasonably sized messages (100ms samples each) for smooth transmission.
  • On reconnection, resume buffering immediately.

Avoid Audio Overlap During Tool Calls

When the model invokes a tool during a voice response, the server delivers all audio deltas first, then the function call events alongside response.done. If your client immediately sends conversation.item.create (with the function result) followed by response.create, the server starts generating the next response right away — even if the client is still playing audio from the previous turn. This causes overlapping audio.

Recommended sequence:

  1. Receive response.function_call_arguments.done → execute your tool
  2. Send conversation.item.create with the function_call_output
  3. Wait until audio playback of the current turn is complete (or nearly complete)
  4. Then send response.create

While waiting for playback to finish, show a visual "thinking" indicator (e.g., animated dots) so the user knows the agent is processing. This creates a natural pause between the model's spoken response and the follow-up after the tool result.

Javascript

ws.on("message", async (message) => {
  const event = JSON.parse(message);

  if (event.type === "response.function_call_arguments.done") {
    // 1. Execute the tool
    const result = await executeFunction(event.name, JSON.parse(event.arguments));

    // 2. Send the function result immediately
    ws.send(JSON.stringify({
      type: "conversation.item.create",
      item: {
        type: "function_call_output",
        call_id: event.call_id,
        output: JSON.stringify(result),
      },
    }));

    // 3. Show a "thinking" indicator in the UI
    showThinkingIndicator();

    // 4. Wait for current audio playback to finish
    await waitForPlaybackComplete();

    // 5. Now request the next response
    ws.send(JSON.stringify({ type: "response.create" }));
    hideThinkingIndicator();
  }
});

Additional High-Impact Recommendations

  • Prefer ephemeral tokens for client-side security.
  • Enable server_vad for automatic, natural barge-in.
  • Match input/output format (24 kHz PCM) to avoid resampling.
  • Stream output audio deltas (response.output_audio.delta) to the speaker instantly — do not wait for the full response.
  • Implement graceful reconnection while continuing to buffer new audio.
  • Monitor WebSocket health and use exponential backoff if needed.

OpenAI Realtime API Compatibility

The Grok Voice Agent API is compatible with the OpenAI Realtime API. Most OpenAI client libraries and SDKs work with the xAI endpoint by changing the base URL to wss://api.x.ai/v1/realtime. This section documents event naming differences and unsupported events.

Event Naming Differences

The xAI API uses the OpenAI beta event names for text output. These events are functionally identical:

OpenAI GA EventxAI Event
response.output_text.deltaresponse.text.delta

Unsupported Client Events

OpenAI EventNotes
conversation.item.retrieveNot supported.
conversation.item.truncateNot supported.
output_audio_buffer.clearWebRTC/SIP only.

Unsupported Server Events

OpenAI EventNotes
conversation.item.doneNot emitted.
conversation.item.input_audio_transcription.deltaUse completed instead (emitted for both partial and final transcripts).
conversation.item.input_audio_transcription.failedNot emitted.
conversation.item.input_audio_transcription.segmentNot supported.
conversation.item.retrievedNot supported.
conversation.item.truncatedNot supported.
input_audio_buffer.dtmf_event_receivedSIP only.
input_audio_buffer.timeout_triggeredNot emitted.
output_audio_buffer.startedWebRTC/SIP only.
output_audio_buffer.stoppedWebRTC/SIP only.
output_audio_buffer.clearedWebRTC/SIP only.
rate_limits.updatedNot emitted.

Did you find this page helpful?