Inference API
Chat
Chat completions
/v1/chat/completions
Create a chat response from text/image chat prompts. This is the endpoint for making requests to chat and image understanding models.
Request Body
Response Body
choices
array
A list of response choices from the model. The length corresponds to the n in request body (default to 1).
created
integer
The chat completion creation time in Unix timestamp.
id
string
A unique ID for the chat response.
model
string
Model ID used to create chat completion.
object
string
The object type, which is always "chat.completion".
Create new response
/v1/responses
Generates a response based on text or image prompts. The response ID can be used to retrieve the response later or to continue the conversation without repeating prior context. New responses will be stored for 30 days and then permanently deleted.
Request Body
input
string | array
required
Content of the input passed to a /v1/response request.
Response Body
background
boolean
default: false
OpenResponses compatibility fields. Not used at the moment. Just for OpenResponses compatibility. Whether to process the response asynchronously in the background.
created_at
integer
The Unix timestamp (in seconds) for the response creation time.
error
An error object returned when the model fails to generate a response.
frequency_penalty
number
(NOT SUPPORTED in Responses API) Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
id
string
Unique ID of the response.
metadata
Only included for compatibility.
model
string
Model name used to generate the response.
object
string
The object type of this resource. Always set to response.
output
array
The response generated by the model.
parallel_tool_calls
boolean
Whether to allow the model to run parallel tool calls.
presence_penalty
number
(NOT SUPPORTED in Responses API) Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
service_tier
string
default: default
Specifies the processing tier used for serving the request.
status
string
Status of the response. One of completed, in_progress or incomplete.
store
boolean
default: true
Whether to store the input message(s) and model response for later retrieval.
text
object
tool_choice
string | object
Parameter to control how model chooses the tools.
tools
array
A list of tools the model may call in JSON-schema. Currently, only functions and web search are supported as tools. A max of 128 tools are supported.
top_logprobs
integer
An integer between 0 and 8 specifying the number of most likely tokens to return at each token position.
truncation
string
default: disabled
The truncation strategy to use for the model response.
Retrieve previous response
/v1/responses/{response_id}
Retrieve a previously generated response.
Path parameters
response_id
string
required
The response id returned by a previous create response request.
Response Body
background
boolean
default: false
OpenResponses compatibility fields. Not used at the moment. Just for OpenResponses compatibility. Whether to process the response asynchronously in the background.
created_at
integer
The Unix timestamp (in seconds) for the response creation time.
error
An error object returned when the model fails to generate a response.
frequency_penalty
number
(NOT SUPPORTED in Responses API) Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
id
string
Unique ID of the response.
metadata
Only included for compatibility.
model
string
Model name used to generate the response.
object
string
The object type of this resource. Always set to response.
output
array
The response generated by the model.
parallel_tool_calls
boolean
Whether to allow the model to run parallel tool calls.
presence_penalty
number
(NOT SUPPORTED in Responses API) Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
service_tier
string
default: default
Specifies the processing tier used for serving the request.
status
string
Status of the response. One of completed, in_progress or incomplete.
store
boolean
default: true
Whether to store the input message(s) and model response for later retrieval.
text
object
tool_choice
string | object
Parameter to control how model chooses the tools.
tools
array
A list of tools the model may call in JSON-schema. Currently, only functions and web search are supported as tools. A max of 128 tools are supported.
top_logprobs
integer
An integer between 0 and 8 specifying the number of most likely tokens to return at each token position.
truncation
string
default: disabled
The truncation strategy to use for the model response.
Delete previous response
/v1/responses/{response_id}
Delete a previously generated response.
Path parameters
response_id
string
required
The response id returned by a previous create response request.
Response Body
deleted
boolean
Whether the response was successfully deleted.
id
string
The response_id to be deleted.
object
string
The deleted object type, which is always response.
Get deferred chat completions
/v1/chat/deferred-completion/{request_id}
Tries to fetch a result for a previously-started deferred completion. Returns 200 Success with the response body, if the request has been completed. Returns 202 Accepted when the request is pending processing.
Path parameters
request_id
string
required
The deferred request id returned by a previous deferred chat request.
Response Body
choices
array
A list of response choices from the model. The length corresponds to the n in request body (default to 1).
created
integer
The chat completion creation time in Unix timestamp.
id
string
A unique ID for the chat response.
model
string
Model ID used to create chat completion.
object
string
The object type, which is always "chat.completion".
Did you find this page helpful?