Debates through the benchmark AI – and how you reported by AI Labs – spilled to the public view.
This week, opening employees accused Ai Elon Musk, Xai, Issuing the most recent benchmark results for the latest, Grok 3-founder. One of the founders of Sani, Igor Babushkin, insisted that the company is on the right.
The truth lies somewhere in between.
In a Submit on Blog XaiThe issuance issued a graph showing the Grok Performance 3 in Aime 2025, as a collection of mathematical questions that challenge the new mermaidical examination of the new invitational exam. Some experts have AIME ARE AIME is a benchmark AISee rankings-. However, AIME 2025 and older test versions are used to test the model mathematical ability.
The Xai graph show two Vale Grock 3, Grok 3 Reasons Beta and Grok 3 MINI reasons, Available model available on the opening, O3-mini-highIn AIME 2025. Yet opening Opening on X is quick to show that the xaai chart does not include AIME 2025 O3-High-mini-High-mini-mini-minime’s AIME 2025. ”
What is the Cons @ 64, can you ask? Yes, short to “consensus @ 64,” and actually give a model 64 trying to answer each problem on the benchmark and take an answer often. As you can imagine, this @ 64 tends to expand the sign score – and eliminate from the graph may appear as if there is another model model when it is not the case.
Grok 3 reasons Beta and Grok 3 mini scores for AIME 2025 in “@ 1” – The first score of the model – falls on the O3-High School score. Grok 3 Reasons Beta is also a slightly trail of the opening Model O1 set to the “medium” computation. Didn’t XAI Grok Ads 3 as a “the most intelligent AI.”
Babushkin Argued on x The opening has published the same as the luxurious in the past – although Chart compares its own model performance. The more neutral party in the debate includes a more graph “the right” show almost all the performance of the model in the Cons @ 64:
Hilarious how some people see my plot as a strike on the opening and others for attacking grits while in the fact that propaganda do
(I actually believed the glitter looks good there, and Chianer TTC opens below O3-Mini- * -Pute @ “” “” “deserving.) https://t.com.co/Djqljpcjh8 pic.twetter.com/3wh8foufic– Teortexes ▶ ️ (TwoWebsite Powder Powder 2023 – ∞) (@TEORTaxestex) February 20 2025
But as Nathan Lambert pointed in postIt may be the most important metric of the mystery: computation (and monetary) cost for each model to get the best score. That only shows the benchmarks generally to communicate the model limit – and their strength.