Summarize Simplifying... In short OpenAI's new reasoning model, 'o3', is making waves with its ability to fact-check itself and adjust its reasoning time, outperforming previous models in various benchmarks.

However, its release is delayed due to safety concerns and the need for a new alignment technique to mitigate deception risks.

Despite the delay and high costs, the model's impressive performance hints at a step towards artificial general intelligence. Was a long read? Making it simpler... Next Article Next Article

o3 models are in test phase at the moment

OpenAI unveils 'o3' reasoning models with groundbreaking results in benchmarks

By Akash Pandey 11:34 am Dec 21, 202411:34 am

What's the story OpenAI unveiled its latest reasoning model, o3, on the last day of its 12-day "shipmas" event. The new model is an improved version of the previously released o1 "reasoning" model. Notably, the o3 is a model family, similar to o1. There's o3 and a compact version called o3-mini, which has been fine-tuned for specific tasks.

AGI progress

o3: A step toward AGI?

OpenAI hints that o3 could be a step toward artificial general intelligence (AGI) under certain conditions. However, this comes with major caveats. The company's CEO Sam Altman has previously expressed his preference for a federal testing framework to guide the monitoring and mitigation of risks associated with new reasoning models like o3.

Information

Naming controversy: Why 'o3' and not 'o2'

The name o3 was chosen instead of o2 reportedly due to trademark issues. To avoid any possible conflict with British telecom provider O2, OpenAI opted for this naming strategy. Altman confirmed this during a recent livestream.

Release plan

Availability and safety testing

As of now, neither o3 nor o3-mini are available widely. However, safety researchers can sign up for a preview of the o3-mini model. The full version of the o3 model is slated to be launched later, with Altman saying they plan to release o3-mini by January end and follow it up with the launch of o3.

Risk mitigation

Concerns and new alignment technique

AI safety testers have discovered that o1's reasoning capabilities could make it deceive human users more often than traditional models. To counter these issues, OpenAI is using a new technique called "deliberative alignment" to ensure models like o3 align with its safety principles. The success of this technique in reducing deception risks will be determined when testing results are released by OpenAI's red-team partners.

Model capabilities

Unique features of o3 models

Unlike most AI, reasoning models like o3 can fact-check themselves, preventing themselves from pitfalls that often trip other models. However, this self-checking can lead to some delay in reaching solutions than regular non-reasoning models. Despite this latency, these reasoning models are more reliable in fields like science, physics, and mathematics. The key difference between o3 and o1 is the ability to "adjust" reasoning time—low, medium, or high-compute (i.e., thinking time). Higher compute settings allow o3 to perform better on tasks.

Benchmark scores

o3 scored 87.5% in high compute setting

According to a benchmark, OpenAI is slowly edging closer to AGI. On ARC-AGI—a test assessing an AI system's ability to learn new skills beyond its training data—o3 scored 87.5% in the high compute setting. Even at its lowest (low compute setting), the model tripled the o1's performance. Admittedly, the high compute setting was incredibly costly, running into thousands of dollars per challenge, according to ARC-AGI co-creator François Chollet.

Twitter Post

Take a look at Chollet's post

More

o3 blows away competition in EpochAI's Frontier Math benchmark

The o3 also outperformed o1 by 22.8% on SWE-Bench Verified and achieved a Codeforces rating of 2727. It scored 96.7% on the 2024 American Invitational Mathematics Exam, missing just one question, and 87.7% on GPQA Diamond, a test comprising graduate-level biology, physics, and chemistry questions. Finally, o3 set a new record on EpochAI's Frontier Math benchmark, solving 25.2% of problems—a feat unmatched by any other model, as none have surpassed 2%.

Twitter Post

How it performed in benchmarks