The recent introduction of OpenAI’s Responses API marks an evolution in how developers interact with large language models. This new API primitive sits alongside the established Chat Completions API but offers a more sophisticated foundation for building action-oriented applications. This article examines the technical architecture, capabilities, and implementation details of the Responses API for developers looking to leverage its enhanced functionality.
API Architecture and Design Philosophy
The Responses API represents OpenAI’s shift toward more agentic API primitives. While the Chat Completions API follows a straightforward request-response pattern, the Responses API employs an event-driven architecture that better accommodates tool execution, multi-turn reasoning, and stateful interactions.
At its core, the Responses API uses a REST-based design for the initial request, but the streaming capabilities deliver a structured event stream that offers detailed semantic information about model outputs as they occur, rather than just appending tokens to content fields.
Request Schema and Parameters
The base endpoint for the Responses API is:
POST https://api.openai.com/v1/responses
Key request parameters include:
{
"model": "gpt-4o",
"input": "Tell me a three sentence bedtime story about a unicorn.",
"modalities": ["text", "audio"],
"instructions": "Please assist the user.",
"tools": [],
"tool_choice": "auto",
"temperature": 0.8,
"max_output_tokens": 1024,
"stream": true
}
Notable parameters include:
model
: Specifies which model to use (e.g., “gpt-4o” or “o1”)input
: The user’s message (similar to messages array in Chat Completions)modalities
: Output types the model should generate (e.g., [“text”] or [“text”, “audio”])tools
: Array of tools the model can callparallel_tool_calls
: Boolean enabling multiple simultaneous tool callsprevious_response_id
: For multi-turn conversations, reference to prior responseinclude
: Specifies additional output data to include (e.g., for file search results)
Response Object Structure
The Responses API returns a more comprehensive and structured object:
{
"id": "resp_67ccd2bed1ec8190b14f964abc0542670bb6a6b452d3795b",
"object": "response",
"created_at": 1741476542,
"status": "completed",
"output": [
{
"type": "message",
"id": "msg_67ccd2bf17f0819081ff3bb2cf6508e60bb6a6b452d3795b",
"status": "completed",
"role": "assistant",
"content": [
{
"type": "output_text",
"text": "In a peaceful grove beneath a silver moon, a unicorn named Lumina discovered a hidden pool that reflected the stars. As she dipped her horn into the water, the pool began to shimmer, revealing a pathway to a magical realm of endless night skies. Filled with wonder, Lumina whispered a wish for all who dream to find their own hidden magic, and as she glanced back, her hoofprints sparkled like stardust.",
"annotations": []
}
]
}
],
"usage": {
"input_tokens": 36,
"output_tokens": 87,
"total_tokens": 123
}
}
The response structure differs from Chat Completions in several important ways:
– status
field indicates processing state (completed, failed, in_progress, incomplete)
– output
is an array that can contain multiple content items
– Usage statistics are directly embedded in the response
Event-Driven Streaming Architecture
Unlike Chat Completions’ undifferentiated stream of tokens, the Responses API emits structured semantic events. When streaming is enabled, the API emits events such as:
event: response.created
data: {"type": "response.created", "response": {...}}
event: response.in_progress
data: {"type": "response.in_progress", "response": {...}}
event: response.output_item.added
data: {"type": "response.output_item.added", "output_index": 0, "item": {...}}
event: response.content_part.added
data: {"type": "response.content_part.added", "item_id": "msg_123", "output_index": 0, "content_index": 0, "part": {...}}
event: response.output_text.delta
data: {"type": "response.output_text.delta", "item_id": "msg_123", "output_index": 0, "content_index": 0, "delta": "In"}
event: response.output_text.done
data: {"type": "response.output_text.done", "item_id": "msg_123", "output_index": 0, "content_index": 0, "text": "In a peaceful grove..."}
event: response.completed
data: {"type": "response.completed", "response": {...}}
This architecture allows developers to implement more precise UI updates and state management, parsing specific event types rather than calculating differences between content chunks.
Tool Integration
One of the Responses API’s most significant advancements is its first-class support for tools. The API currently offers three built-in tools:
1. Web Search Tool
{
"tools": [{
"type": "web_search",
"web_search": {
"search_query": "latest research on quantum computing"
}
}]
}
The web search tool performs real-time internet searches and incorporates the results into model responses, enabling up-to-date information retrieval.
2. File Search Tool
{
"tools": [{
"type": "file_search"
}],
"tool_resources": {
"file_search": {
"vector_store_ids": ["vs_abc123"]
}
}
}
The file search tool enables semantic search across user-uploaded documents stored in vector stores, allowing models to reference and cite specific information from provided files.
3. Computer Use Tool
{
"tools": [{
"type": "computer"
}]
}
The computer tool grants models access to execute code, generate images, create visualizations, and analyze data directly, similar to the Code Interpreter in the Assistants API but with a more streamlined integration.
Custom Function Calling
Beyond built-in tools, the Responses API supports custom function definitions similar to the Chat Completions API:
{
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state"
}
},
"required": ["location"]
}
}
}]
}
The tool_choice
parameter controls whether and which tool the model uses, with options including “auto”, “none”, “required”, or specifying a particular function.
Stateful Conversation Management
The Responses API simplifies multi-turn conversations through its stateful design. Instead of maintaining and sending the full conversation history with each request, developers can reference previous interactions using the previous_response_id
parameter:
{
"model": "gpt-4o",
"input": "What about tomorrow?",
"previous_response_id": "resp_abc123"
}
This approach reduces payload sizes and eliminates the need for client-side conversation tracking, as OpenAI handles the state management internally.
Technical Implementation Details
Token Management
The Responses API provides more granular control over token usage through parameters like:
max_output_tokens
: Controls the number of tokens generated in the responsemax_prompt_tokens
: Limits the tokens used in the prompttruncation_strategy
: Controls how thread history gets truncated when approaching context limits
Response Format Control
Like Chat Completions, the Responses API supports structured outputs:
{
"text": {
"format": {
"type": "json_schema",
"json_schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"}
}
}
}
}
}
This ensures responses conform to specific JSON schemas or always produce valid JSON objects.
Error Handling
The Responses API uses a consistent error format with improved detail:
{
"type": "error",
"code": "ERR_SOMETHING",
"message": "Something went wrong",
"param": null
}
In streaming mode, errors are emitted as events within the stream rather than breaking the connection, allowing more graceful client-side recovery.
Performance Considerations
The Responses API introduces certain performance tradeoffs compared to Chat Completions:
- Latency: Initial response time may be slightly higher due to the more complex architecture, particularly when using tools
-
Throughput: Built-in tool calls can reduce overall throughput but eliminate the need for separate API calls to manage tools
-
Token efficiency: The stateful design can reduce token usage in multi-turn conversations by eliminating the need to resend conversation history
-
Streaming efficiency: The semantic event structure adds slight overhead to streaming but significantly reduces client-side parsing complexity
Migrating from Chat Completions
Developers migrating from Chat Completions to the Responses API should consider these key architectural differences:
- Request structure: Responses uses a flatter request structure rather than nested message arrays
-
Stream processing: Event-driven rather than content-appending approach
-
Tool implementation: First-class support for tools versus function calling patterns
-
State management: Built-in conversation state versus client-managed history
Here’s a simple comparison of a basic request in both APIs:
Chat Completions:
{
"model": "gpt-4o",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello world"}
]
}
Responses API:
{
"model": "gpt-4o",
"input": "Hello world",
"instructions": "You are a helpful assistant."
}
Conclusion
The Responses API represents a significant technical evolution in OpenAI’s API offerings, particularly for applications requiring agentic capabilities. While the Chat Completions API remains the industry standard for straightforward text generation, the Responses API provides a more sophisticated foundation for applications that need complex tool integration, stateful interactions, and precise control over model outputs.
As developers build increasingly sophisticated AI applications, the Responses API’s event-driven architecture, first-class tool support, and simplified state management offer compelling advantages that streamline development workflows and enable more complex use cases. OpenAI will continue maintaining both APIs, ensuring backward compatibility while pushing forward with new models and capabilities for both interfaces.