OpenAI’s New o3 and o4‑Mini Models, Push for Reasoning‑First AI

OpenAI has released two new large‑language models designed to “think” more deliberately, bringing multimodal problem‑solving and fuller tool autonomy to millions of ChatGPT users. The models, o3, billed as the company’s most capable reasoning engine, and the smaller, faster o4‑Minim, went live on 16 April for Plus, Pro and Team subscribers, with enterprise and education tiers to follow shortly.

Introducing OpenAI o3 and o4-mini—our smartest and most capable models to date.

For the first time, our reasoning models can agentically use and combine every tool within ChatGPT, including web search, Python, image analysis, file interpretation, and image generation. pic.twitter.com/rDaqV0x0wE
— OpenAI (@OpenAI) April 16, 2025

The road to release

OpenAI originally planned to fold o3 into its forthcoming GPT‑5 release, but reversed course after internal testing suggested the model was ready for public use. The last‑minute pivot saw o3 and o4‑Mini launched as a pair, while GPT‑5 was pushed back “a few months”.The move leaves the firm juggling GPT‑4.1, GPT‑4o and the o‑series side by side, prompting renewed calls from developers for clearer product branding.

change of plans: we are going to release o3 and o4-mini after all, probably in a couple of weeks, and then do GPT-5 in a few months.

there are a bunch of reasons for this, but the most exciting one is that we are going to be able to make GPT-5 much better than we originally…
— Sam Altman (@sama) April 4, 2025

Both models integrate pictures directly into their chain of thought, enabling them to zoom, rotate or annotate images as part of the reasoning process, a capability OpenAI markets as “thinking with images”. They can also decide, step‑by‑step, when to search the web, execute Python, interpret user‑supplied files or generate fresh graphics. The update nudges ChatGPT closer to an “agentic” workflow in which the assistant plans and executes multi‑stage tasks with minimal user guidance.

Benchmarks: performance without brute force

OpenAI claims o3 sets state‑of‑the‑art results on Codeforces, SWE‑bench and MMMU, while o4‑Mini achieves a 99.5 % pass@1 score on the AIME 2025 mathematics exam when allowed to call Python. Independent hands‑on tests by TechRadar found o3 strongest at analytical prompts and o4‑Mini nearly as accurate but noticeably faster and cheaper to run. Early reviewers also highlighted both models’ calm explanation style, contrasting with GPT‑4o’s more conversational tone.

Big news from @OpenAI & @sama! Their new o3 & o4-mini models just set new records, taking the #1 and #2 spots for accuracy across our benchmarks! They claim these models are their "smartest yet," so we put them to the test. Let's break down the findings… (1/7) pic.twitter.com/kieL2DNzjG
— Vals AI (@_valsai) April 18, 2025

For ChatGPT Plus subscribers, o3 is capped at 50 prompts per week, whereas o4‑Mini offers 150 messages per day; upgrading to the US$200‑per‑month Pro plan raises those limits to “near‑unlimited” use, subject to abuse checks. OpenAI says the smaller model is priced “on par with GPT‑4o” in the API, positioning it as a high‑throughput option for organisations that can trade absolute peak accuracy for speed and cost efficiency.

Safety ledger and hallucination trade‑offs

Despite a new preparedness framework, internal tests show o3 hallucinating on 33 % of PersonQA questions, double the rate of its predecessor o1, while o4‑Mini fares worse at 48 %. Researchers attribute the rise to reinforcement‑learning tweaks that encourage longer, more detailed answers, increasing the probability of correct and incorrect claims. OpenAI acknowledged the issue, stating that further research is “ongoing”.

Developers have welcomed the models’ richer tool orchestration, and some sizeable software firms are already piloting o4‑Mini for customer‑support chatbots where response speed matters. Yet several financial‑services companies contacted by Axios said they would wait for third‑party audits of hallucination rates before extending production use.

o3 is out and it is absolutely amazing!!

i've been playing with it for a week or so and it's already my go-to model. it's fast, agentic, extremely smart, and has great vibes.

some of my top use cases:

– it flagged every single time I sidestepped conflict in my meeting… pic.twitter.com/qhiaazYKJj
— Dan Shipper 📧 (@danshipper) April 16, 2025

Conclusion

With GPT‑5 still on the horizon, o3 serves as OpenAI’s flagship for agentic reasoning, while o4‑Mini offers a pragmatic blend of speed, cost and capability. The launch cements the o-series as a distinct line alongside the broader GPT family, but also underlines a central tension in current AI research: deeper reasoning does not automatically deliver higher factual reliability.