Image Understanding
The vision model can receive both text and image inputs. You can pass images into the model in one of two ways: base64 encoded strings or web URLs.
Under the hood, image understanding shares the same API route and the same message body schema consisted of system
/user
/assistant
messages. The difference is having image in the message content body instead of text.
As the knowledge in this guide is built upon understanding of the chat capability. It is suggested that you familiarize yourself with the chat capability before following this guide.
Prerequisites
- xAI Account: You need an xAI account to access the API.
- API Key: Ensure that your API key has access to the vision endpoint and a model supporting image input is enabled.
If you don't have these and are unsure of how to create one, follow the Hitchhiker's Guide to Grok.
Set your API key in your environment:
Reminder on image understanding model general limitations
It might be easier to run into model limit with these models than chat models:
- Maximum image size:
10MiB
- Maximum number of images: No limit
- Any image/text input order is accepted (e.g. text prompt can precede image prompt)
Constructing the Message Body - Difference from Chat
The request message to image understanding is similar to chat. The main difference is that instead of text input:
We send in content
as a list of objects:
The image_url.url
can also be the image's url on the Internet.
You can use the text prompt to ask questions about the image(s), or discuss topics with the image as context to the discussion, etc.
Web URL Input
The model supports web URL as inputs for images. The API will fetch the image from the public URL and handle it as part of the chat. Integrating with URLs is as simple as:
Base64 string input
You will need to pass in base64 encoded image directly in the request, in the user messages.
Here is an example of how you can load a local image, encode it in Base64 and use it as part of your conversation:
Multiple images input
You can send multiple images in the prompt, for example:
The image prompts can interleave with text prompts in any order.
Image token usage
The prompt image token usage is provided in the API response.