OpenAI

For the finale of its12 Days of OpenAIlivestream effect , CEO Sam Altman bring out its next foundation mannikin , and successor to the latterly announcedo1 home of reasoning artificial insemination , dubbed o3 and 03 - mini .

And no , you are n’t going unbalanced — OpenAI skipped right over o2 , apparently to avert infringing on the copyright of British telecom provider O2 .

Sam Altman describing the o3 model’s capabilities

OpenAI

While the novel o3 mannikin are not being unloose to the public just yet and there ’s no watchword on when they ’ll be incorporated intoChatGPT , they are now useable for test by safety and surety researchers .

o3 , our tardy reasoning model , is a breakthrough , with a step function melioration on our hardest benchmark . we are starting safety testing & amp ; red teaming now.https://t.co/4XlK1iHxFK

& mdash ; Greg Brockman ( @gdb)December 20 , 2024

The o3 kinsfolk , like the o1 ’s before it , operate otherwise than traditional generative models in that they will internally fact - check their responses prior to presenting them to the user . While this proficiency slow the model ’s reply time anywhere from a few seconds to a few minutes , its solvent to complex scientific discipline , math , and slang queries incline to be more exact and true than what you ’d get fromGPT-4 . Additionally , the model is in reality capable to transparently explain its reasoning in how it make it at its consequence .

Users can also manually line up the amount of clip the model spend considering a trouble by selecting between humbled , average , and high compute with the high mount returning the most complete answers . That functioning does not come cheap , beware you . The processing at high compute reportedly will be M of dollars per task , ARC - AGI carbon monoxide gas - creator Francois Chollet wrote in an X post Friday .

Today OpenAI announce o3 , its next - gen reasoning exemplar . We've work with OpenAI to test it on ARC - AGI , and we believe it represent a significant breakthrough in contract AI to accommodate to novel tasks .

It scores 75.7 % on the semi - secret eval in low - compute mode ( for $ 20 per task…pic.twitter.com/ESQ9CNVCEA

& mdash ; François Chollet ( @fchollet)December 20 , 2024

The new family of abstract thought model reportedly offer importantly better carrying out over even o1 , whichdebuted in September , on the manufacture ’s most challenging benchmark tests . According to the company , o3 exceed its predecessor by about 23 share points on the SWE - Bench affirm coding test and sexual conquest more than 60 points higher than o1 on   Codeforce ’s benchmark . The fresh framework also hit an impressive 96.7 % on the AIME 2024 maths test , lose just one doubtfulness , and outgo human experts on the GPQA Diamond , notch a grade of 87.7 % . Even more impressive , 03 reportedly solve more than a fourth part of the problem confront on the EpochAI Frontier Math bench mark , where other poser have struggle to right puzzle out more than 2 % of them .

OpenAI does observe that the models it previewed on Friday are still early versions and that “ final results may germinate with more post - training . ” The company has additionally incorporated new “ deliberative alliance ” safety measures into o3 ’s training methodology . The o1 reasoning model has usher a troubling drug abuse of trying to lead on human judge at a high rate than schematic AI like GPT-4o , Gemini , or Claude ; OpenAI believe that the new guardrail will help minimize those tendencies in o3 .

Members of the research community interested in trying o3 - mini for themselves can ratify up for approach onOpenAI ’s waitlist .