Usage, Tiers and Rate Limits
During beta, all users have access to the free tier rate limits. You might be able to request quota increase once you become paying customer with pre-purchased credit.
To check your tier and rate limit, visit xAI Console Models Page.
Tokens
Token is the base unit of prompt size for model inference and pricing purposes. It is consisted of one or more character(s)/symbol(s).
When a Grok model handles your request, an input prompt will be decomposed into a list of tokens through a tokenizer. The model will then make inference based on the prompt tokens, and generate completion tokens. After the inference is completed, the completion tokens will be aggregated into a completion response sent back to you.
For a given text/image/etc. prompt or completion sequence, different tokenizers may break it down into different lengths of lists. Different Grok models may also share or use different tokenizers. Therefore, the same prompt/completion sequence may not have the same amount of tokens across different models.
The token count in a prompt/completion sequence should be approximately linear to the sequence length.
Hitting rate limits
For each tier, there is a maximum amount of requests per minute. This is to ensure fair usage by all users of the system.
Once your request frequency has reached the rate limit, you will receive error code 429
in response.
You can either:
- Upgrade your team to higher tiers
- Change your consumption pattern to send less requests
Checking token consumption
In each completion response, there is a usage
object detailing your prompt and completion token count. You might find it helpful to keep track of it, in order to avoid hitting rate limits or having cost surprises.
You can also check with OpenAI or Anthropic SDKs. OpenAI SDK: