Model Capabilities
Image Understanding
When sending images, it is advised to not store request/response history on the server. Otherwise the request may fail. See Disable storing previous request/response on server.
Some models allow images in the input. The model will consider the image context when generating the response.
Constructing the message body - difference from text-only prompt
The request message to image understanding is similar to text-only prompt. The main difference is that instead of text input:
JSON
[
{
"role": "user",
"content": "What is in this image?"
}
]
We send in content as a list of objects:
JSON
[
{
"role": "user",
"content": [
{
"type": "input_image",
"image_url": "data:image/jpeg;base64,<base64_image_string>",
"detail": "high"
},
{
"type": "input_text",
"text": "What is in this image?"
}
]
}
]
The image_url.url can also be the image's url on the Internet.
Image understanding example
import os
from xai_sdk import Client
from xai_sdk.chat import user, image
client = Client(
api_key=os.getenv("XAI_API_KEY"),
management_api_key=os.getenv("XAI_MANAGEMENT_API_KEY"),
timeout=3600,
)
image_url = "https://science.nasa.gov/wp-content/uploads/2023/09/web-first-images-release.png"
chat = client.chat.create(model="grok-4-1-fast-reasoning")
chat.append(
user(
"What's in this image?",
image(image_url=image_url, detail="high"),
)
)
response = chat.sample()
print(response)
# The response ID that can be used to continue the conversation later
print(response.id)
Image input general limits
- Maximum image size:
20MiB - Maximum number of images: No limit
- Supported image file types:
jpg/jpegorpng. - Any image/text input order is accepted (e.g. text prompt can precede image prompt)
Image detail levels
The "detail" field controls the level of pre-processing applied to the image that will be provided to the model. It is optional and determines the resolution at which the image is processed. The possible values for "detail" are:
"auto": The system will automatically determine the image resolution to use. This is the default setting, balancing speed and detail based on the model's assessment."low": The system will process a low-resolution version of the image. This option is faster and consumes fewer tokens, making it more cost-effective, though it may miss finer details."high": The system will process a high-resolution version of the image. This option is slower and more expensive in terms of token usage, but it allows the model to attend to more nuanced details in the image.