Prompt Caching
How It Works
The cache works from the start of your messages array. When a request arrives, the system checks how many messages at the beginning match a previous request exactly — that matching portion is the "prefix" and gets served from cache:
- First request — The full prompt is processed and cached
- Subsequent requests — If the prompt prefix matches, the cached portion is reused (a cache hit)
- Billing — Cached tokens are billed at a reduced rate
Prompt caching is not 100% guaranteed. Cache entries can be evicted due to memory pressure, and requests may be routed to different servers. Use x-grok-conv-id to maximize cache hit rates.
Example
Request 1:
Text
[system] "You are a helpful assistant."
[user] "What is the capital of France?"
[assistant] "The capital of France is Paris."
Request 2:
Text
[system] "You are a helpful assistant." ← cached
[user] "What is the capital of France?" ← cached
[assistant] "The capital of France is Paris." ← cached
[user] "What about Germany?" ← new
The first 3 messages match Request 1 exactly, so they're served from cache. Only the new message is computed.
Next
Did you find this page helpful?