OpenAI has launched o3 and o4-mini, its most advanced AI models to date. Featuring full tool access, native multimodal reasoning, and improved instruction-following, these models significantly elevate autonomous problem-solving across real-world, academic, and visual tasks—but also introduce new concerns around increased hallucination rates and reliability in high-stakes environments.

OpenAI Launches New Reasoning Models o3 and o4-mini – Key Points
New Model Launch & Availability
On April 16, 2025, OpenAI released o3 and o4-mini—two reasoning-focused models with full access to ChatGPT’s toolset, including browsing, Python code execution, file analysis, and image generation. These models are available to ChatGPT Plus, Pro, and Team users; Enterprise and Edu accounts gain access one week later. Free-tier users can test o4-mini via the “Think” option. The release replaces older models o1, o3-mini, and o3-mini-high across all tiers.
Simulated Reasoning with Tool Autonomy
Trained using reinforcement learning, the models dynamically select and combine tools to execute complex, multi-step tasks. They can autonomously retrieve public data, write and execute Python code, analyze images or documents, generate visual outputs, and pivot strategies based on updated input—all typically within a minute. In one example, o3 handled a query about California’s summer energy usage by chaining search, coding, graph generation, and narrative explanation into a single response. Some tasks call for coordinating hundreds of tools, showcasing emerging agentic behavior.
Multimodal Capabilities—”Thinking with Images”
o3 and o4-mini natively incorporate images into their reasoning process—an OpenAI first. They can interpret blurry, rotated, upside-down, or hand-drawn visuals like whiteboards, signs, charts, and diagrams. Operations such as cropping, zooming, and rotation are applied mid-reasoning, enhancing accuracy. On the V* visual benchmark, o3 achieved 95.7% accuracy. These models unify vision and language pipelines for richer output, eliminating reliance on external vision systems.
Instruction Following and Personalization
The models deliver natural, verifiable, and context-aware outputs. They can remember prior conversation context, follow nuanced user instructions, and tailor responses to ongoing collaborative or research-heavy sessions. Reviewers highlighted an intuitive, human-aligned tone and improved consistency in complex workflows.
Performance Benchmarks
- o3 makes 20% fewer major errors than o1 on difficult tasks.
- o3 achieved 69.1% on SWE-Bench Verified (software engineering).
- o3 scored 82.9% on MMMU (visual reasoning).
- o4-mini scored 92.7% on AIME 2025; 99.5% with Python access.
- Visual reasoning:
- V*: 95.7% (visual search).
- MathVista and CharXiv: SOTA in STEM and chart interpretation.
- “VLMs are Blind”: perceptual robustness.
- Both models outperformed o1 and o3-mini across GPQA, tau-bench, and cost-performance benchmarks.
Use Case Differentiation
- o3: Suited for advanced research in biology, engineering, and consulting. Strong in creative ideation, image interpretation, and dynamic tool orchestration.
- o4-mini: Balances performance and cost-efficiency. Designed for high-throughput reasoning tasks in coding, math, and general analytics. Offers higher usage limits and lower costs per request than o3-mini.
Developer Access & Codex CLI
Developers can access the models via Chat Completions and Responses APIs. Codex CLI, a lightweight terminal interface, enables local multimodal reasoning on files, terminal outputs, and images. A $1 million fund—offering $25,000 API credit grants—is active to encourage open-source development using Codex CLI.
Expanded Reasoning Example
In algebraic testing, o3 constructed a valid degree-19 polynomial using Dickson polynomial theory. It accurately verified constraints (monic, odd, correct coefficients) and computed p(19) = 1,876,572,071,974,094,803,391,179. A second construction returned 1,934,999,285,524,070,399,999,639. These examples reflect o3’s symbolic reasoning and mathematical synthesis without tool reliance.
Reinforcement Learning Scaling
Both models benefited from a 10× increase in training compute over o1, validating the “more compute = better performance” rule in inference-stage reasoning. OpenAI reports that letting the model “think longer” yields performance gains, particularly in open-ended and high-effort queries. RL-based training also teaches models to reason about tool use, not just tool execution.
Safety Enhancements
OpenAI rebuilt refusal training datasets covering jailbreaks, malware, and biorisks. A dedicated reasoning monitor flags ~99% of dangerous red-teaming prompts. Evaluations under OpenAI’s Preparedness Framework confirmed o3 and o4-mini remain below “High” risk thresholds across biological, cybersecurity, and AI self-improvement dimensions.
Limitations Acknowledged
Despite improved reasoning, hallucination rates have increased. On PersonQA, o3 hallucinated in 33% of responses—double o1’s rate (16%). o4-mini scored 48%. Errors include fabricated code execution, broken web links, and invented actions. Reinforcement learning appears to boost both accurate and false claims due to verbosity and exploratory behavior. OpenAI admits it doesn’t fully understand the uptick and is actively investigating mitigation strategies. For critical use cases, human verification remains essential.
Naming Confusion & Criticism
Critics flagged the confusing naming: o3 is more powerful than o4-mini. Independent lab Transluce observed hallucinated behaviors such as tool usage claims beyond actual capabilities. While o3 ranks high in usability for creative tasks, stakeholders noted concerns about trustworthiness in professional or regulated settings.
Business and Competitive Implications
OpenAI positions o3 and o4-mini as “a step change” toward agentic AI, per President Greg Brockman. They merge the reasoning strength of the o-series with GPT-like fluency. In business settings, CFOs report increasing ROI from generative AI, with 91% expressing high or full trust in output when grounded in company data. Nonetheless, 29% remain concerned about insight quality. OpenAI claims o3 and o4-mini offer superior ROI and lower costs than o1 and o3-mini—further boosting enterprise appeal.
Future Plans
OpenAI is preparing to release o3-pro, a version with enhanced reasoning and full tool access for Pro-tier users. This model aims to unify conversational fluency and dynamic tool use, serving as a precursor to autonomous AI agents that proactively solve user tasks.
Why This Matters:
o3 and o4-mini represent a significant evolution in general-purpose AI—fusing agentic reasoning, tool-based autonomy, and multimodal intelligence. These models can interpret messy real-world input, coordinate complex toolchains, and deliver high-utility outputs across disciplines. They signal a shift toward AI that not only answers but executes. Yet their elevated hallucination rates pose reliability risks in law, medicine, and critical infrastructure. For organizations, the promise of cost-effective automation is real—but only if paired with transparent use and oversight.
Explore how Google’s Gemini 2.5 Pro enhances AI reasoning and multimodal capabilities, setting new benchmarks in coding, math, and science.
Read a comprehensive monthly roundup of the latest AI news!