OpenAI’s New o3 and o4‑Mini Models, Push for Reasoning‑First AI

OpenAI has released two new large‑language models designed to “think” more deliberately, bringing multimodal problem‑solving and fuller tool autonomy to millions of ChatGPT users. The models, o3, billed as the company’s most capable reasoning engine, and the smaller, faster o4‑Minim, went live on 16 April for Plus, Pro and Team subscribers, with enterprise and education tiers to follow shortly.
The road to release
OpenAI originally planned to fold o3 into its forthcoming GPT‑5 release, but reversed course after internal testing suggested the model was ready for public use. The last‑minute pivot saw o3 and o4‑Mini launched as a pair, while GPT‑5 was pushed back “a few months”.The move leaves the firm juggling GPT‑4.1, GPT‑4o and the o‑series side by side, prompting renewed calls from developers for clearer product branding.
Both models integrate pictures directly into their chain of thought, enabling them to zoom, rotate or annotate images as part of the reasoning process, a capability OpenAI markets as “thinking with images”. They can also decide, step‑by‑step, when to search the web, execute Python, interpret user‑supplied files or generate fresh graphics. The update nudges ChatGPT closer to an “agentic” workflow in which the assistant plans and executes multi‑stage tasks with minimal user guidance.
Benchmarks: performance without brute force
OpenAI claims o3 sets state‑of‑the‑art results on Codeforces, SWE‑bench and MMMU, while o4‑Mini achieves a 99.5 % pass@1 score on the AIME 2025 mathematics exam when allowed to call Python. Independent hands‑on tests by TechRadar found o3 strongest at analytical prompts and o4‑Mini nearly as accurate but noticeably faster and cheaper to run. Early reviewers also highlighted both models’ calm explanation style, contrasting with GPT‑4o’s more conversational tone.
For ChatGPT Plus subscribers, o3 is capped at 50 prompts per week, whereas o4‑Mini offers 150 messages per day; upgrading to the US$200‑per‑month Pro plan raises those limits to “near‑unlimited” use, subject to abuse checks. OpenAI says the smaller model is priced “on par with GPT‑4o” in the API, positioning it as a high‑throughput option for organisations that can trade absolute peak accuracy for speed and cost efficiency.
Safety ledger and hallucination trade‑offs
Despite a new preparedness framework, internal tests show o3 hallucinating on 33 % of PersonQA questions, double the rate of its predecessor o1, while o4‑Mini fares worse at 48 %. Researchers attribute the rise to reinforcement‑learning tweaks that encourage longer, more detailed answers, increasing the probability of correct and incorrect claims. OpenAI acknowledged the issue, stating that further research is “ongoing”.

Developers have welcomed the models’ richer tool orchestration, and some sizeable software firms are already piloting o4‑Mini for customer‑support chatbots where response speed matters. Yet several financial‑services companies contacted by Axios said they would wait for third‑party audits of hallucination rates before extending production use.
Conclusion
With GPT‑5 still on the horizon, o3 serves as OpenAI’s flagship for agentic reasoning, while o4‑Mini offers a pragmatic blend of speed, cost and capability. The launch cements the o-series as a distinct line alongside the broader GPT family, but also underlines a central tension in current AI research: deeper reasoning does not automatically deliver higher factual reliability.