OpenAI’s o3-pro model now sets a new standard in advanced AI reasoning, delivering top-tier performance in science, math, and coding benchmarks, while significantly reducing API costs. Despite these gains, o3-pro still relies on simulated reasoning, which can lead to confident but incorrect outputs, highlighting the limitations of current AI reasoning paradigms.
OpenAI Launches o3-pro – Key Points
Launch Date and Availability:
o3-pro launched on June 10, 2025, and is now available for ChatGPT Pro and Team users, replacing o1-pro. Enterprise and Edu access begins the following week. The model is also accessible through the OpenAI developer API.
Pricing Update:
OpenAI cut o3-pro’s API prices by 87% compared to o1-pro, now charging $20 per million input tokens and $80 per million output tokens. The standard o3 model’s API pricing was also reduced by 80%.
- For reference:
- o1-pro was $150 input / $600 output per million tokens.
- o3-mini is $1.10 input / $4.40 output per million tokens.
- 1M input tokens ≈ 750,000 words (roughly War and Peace in length).
- For reference:
Benchmark Superiority (Expanded):
OpenAI published detailed benchmark results showcasing o3-pro’s performance:
- AIME 2024 (Math): o3-pro: 93%, o3: 90%, o1-pro: 86%.
- GPQA Diamond (PhD-level Science): o3-pro: 84%, o3: 81%, o1-pro: 79%.
- Codeforces (Competitive Coding, Elo): o3-pro: 2748, o3: 2517, o1-pro: 1707.
4/4 Reliability Benchmarks:
In scenarios requiring four correct answers in a row, o3-pro outperformed:
- AIME 2024: o3-pro – 90%, o3 – 80%, o1-pro – 80%.
- GPQA Diamond: o3-pro – 76%, o3 – 67%, o1-pro – 74%.
- Codeforces: o3-pro – 2301, o3 – 2011, o1-pro – 1423.
Human Comparative Evaluations:
In blind testing by human reviewers:
- Overall preference: 64% for o3-pro
- Scientific analysis: 64.9%
- Personal writing: 66.7%
- Programming: 62.7%
- Data analysis: 64.3%
Advanced Capabilities:
o3-pro integrates a wide set of tools and functions:
- Web search
- Python code execution
- Image and file analysis
- Memory-based personalization
- Uses chain-of-thought (CoT) processing for reasoning-like output.
Limitations and Technical Notes:
- Slower responses than o1-pro due to expanded tool use and token output.
- No image generation—for visual tasks, GPT-4o or o4-mini are recommended.
- Canvas (OpenAI’s workspace) is not supported.
- Temporary chats are currently disabled due to technical issues.
Simulated Reasoning: What It Actually Means:
Ars Technica and academic studies clarify that “reasoning” in o3-pro does not reflect true logical thinking:
- Simulated reasoning = more inference-time compute + chain-of-thought token planning.
- Outputs are pattern-matched from training data, not built from logical inference.
- Models can still confabulate (produce factual errors with confidence).
- o3-pro often “thinks out loud” in tokens, offering clearer intermediate steps—but this doesn’t mean it can self-correct or recognize logical contradictions.
Known Weaknesses from Research:
- Fails at novelty: Studies show poor performance on unfamiliar logic puzzles like Tower of Hanoi.
- Fails with contradiction: Continues flawed approaches even when output is illogical.
- Scaling paradox: Some models reduce reasoning effort as problems become more complex.
- Even when armed with known algorithms, models don’t reliably apply them.
Future Directions and Mitigation Strategies:
Researchers are working on:
- Self-consistency sampling: Generate multiple solution paths to check for agreement.
- Self-critique prompts: Encourage models to assess their own outputs.
- Tool augmentation: Use external symbolic math engines or calculators to improve output fidelity.
- These are early-stage fixes, not full solutions to the reasoning gap.
Model Philosophy & User Use Case:
As The Neuron’s Corey Noles puts it:
“o3‑Pro isn’t your everyday chat buddy—it’s the brainiac you summon when accuracy trumps speed.”
o3-pro is suited for technical tasks, analysis, and problem-solving where clarity and logic structure matter more than quick response or friendly tone.
Safety and Transparency:
o3-pro shares its safety and interpretability documentation with o3. Full disclosure is provided via the OpenAI o3 system card.
Why This Matters:
OpenAI’s o3-pro offers improved accuracy, lower pricing, and advanced tool use, marking it as a reliable choice for math, coding, and science tasks. However, the model’s reasoning is still simulated—not logical—and remains prone to patterned confabulation under novel conditions. While a strong tool for structured problems, it should be used with caution in high-stakes or unpredictable environments.
OpenAI has introduced ChatGPT Search enabling real-time information retrieval. This development positions ChatGPT as a direct competitor to Google.
Read a comprehensive monthly roundup of the latest AI news!






