Codex CLI: Multimodal Custom Tool Outputs

Codex

Codex CLI 0.107.0 expands the custom tool API to support multimodal return values, enabling tools to return structured content including images alongside text. Previously, custom tools were limited to plain text responses, which constrained use cases involving screenshots, diagrams, and rendered output. This update unlocks a wider class of custom integrations and brings tool output capabilities closer to what agents can natively produce.


Multimodal Outputs for Custom Tools

Starting with Codex CLI 0.107.0, custom tools are no longer restricted to plain text return values. Tools can now return structured multimodal content — including images — giving developers far more flexibility when building integrations that involve visual output.

This change is significant for any workflow where agents interact with tools that produce inherently visual results: browser automation returning screenshots, rendering engines outputting diagrams, test runners generating visual diffs, or code preview tools producing UI snapshots. Previously, these use cases required workarounds such as saving images to disk and referencing them by path. With multimodal tool output support, the image data can be returned directly as part of the tool response, allowing the model to reason about the visual content without additional steps.

What This Enables

The practical scope of this addition covers several developer workflows:

Visual debugging. A tool that captures a rendered browser state or UI component can now return the screenshot inline. The agent receives the image as part of the tool result and can reason about what it sees — identifying layout issues, comparing before/after states, or validating that a UI change produced the expected result.

Diagram generation. Tools that produce Mermaid, Graphviz, or other diagram formats rendered to images can now return those images directly, allowing agents to incorporate visual representations of architecture or data flow into their reasoning.

Structured mixed content. Tool responses can now combine text and images in a single structured payload, matching the richer content model already available in the broader OpenAI API surface.

Relationship to Broader Multimodal Direction

This addition aligns Codex's custom tool API more closely with the multimodal capabilities of the underlying models. As GPT-5.3-Codex-Spark and related models grow their visual reasoning capabilities, having tools that can feed image data directly into the model's context becomes increasingly valuable. Codex CLI 0.107.0 lays the groundwork for a more complete multimodal agent loop.