icon
icon
icon
icon
🏷️$300 Off
🏷️$300 Off

News /

Articles /

OpenAI's o3 Scores 136 on Mensa Norway IQ Test, Surpassing 98% of Humans

Coin WorldThursday, Apr 17, 2025 3:08 pm ET
2min read

OpenAI’s latest language model, “o3,” has achieved a significant milestone by scoring 136 on the Mensa Norway intelligence test. This score places the model above approximately 98 percent of the human population, according to a standardized bell-curve IQ distribution used in the benchmarking. The score was calculated from a seven-run rolling average, reinforcing the model's cognitive capabilities.

The “o3” model is part of the “o-series” of large language models, which have consistently ranked at the top in various cognitive evaluations. The model was evaluated using two benchmark formats: a proprietary “Offline Test” curated by TrackingAI.org and a publicly available Mensa Norway test. Both tests were scored against a human mean of 100. While “o3” scored 116 on the Offline evaluation, it achieved a 136 on the Mensa test, indicating a 20-point boost. This discrepancy suggests that the model may have enhanced compatibility with the Mensa test's structure or data-related confounds such as prompt familiarity.

The Offline Test included 100 pattern-recognition questions designed to avoid any content that might have appeared in the data used to train AI models. Both assessments reported each model’s result as an average across the seven most recent completions. However, no standard deviation or confidence intervals were released alongside the final scores, limiting reproducibility and interpretability. The absence of methodological transparency, particularly around prompting strategies and scoring scale conversion, further complicates the analysis.

TrackingAI.org, the independent platform that compiled the data, uses a standardized prompt format designed to ensure broad AI compliance while minimizing interpretive ambiguity. Each language model is presented with a statement followed by four Likert-style response options and is instructed to select one while justifying its choice. Responses must be clearly formatted, and if a model refuses to answer, the prompt is repeated up to ten times. The most recent successful response is then recorded for scoring purposes, with refusal events noted separately. This methodology aims to provide consistency in comparative assessments while documenting non-responsiveness as a data point in itself.

The Mensa Norway test highlighted the performance gap between the top models, with “o3” leading the pack with a 136 IQ score. In contrast, other popular models like GPT-4o scored considerably lower, landing at 95 on Mensa and 64 on Offline. Among open-source submissions, Meta’s Llama 4 Maverick was the highest-ranked, posting a 106 IQ on Mensa and 97 on the Offline benchmark. Most Apache-licensed entries fell within the 60–90 range, reinforcing the current limitations of community-built architectures relative to corporate-backed research pipelines.

Multimodal models, which incorporate image input capabilities, consistently underperformed their text-only versions. For instance, OpenAI’s “o1 Pro” scored 107 on the Offline test in its text configuration but dropped to 97 in its vision-enabled version. The discrepancy was more pronounced on the Mensa test, where the text-only variant achieved 122 compared to 86 for the visual version. This suggests that some methods of multimodal pretraining may introduce reasoning inefficiencies that remain unresolved at present. However, “o3” can also analyze and interpret images to a very high standard, breaking this trend.

IQ benchmarks provide a narrow window into a model’s reasoning capability, with short-context pattern matching offering only limited insights into broader cognitive behavior such as multi-turn reasoning, planning, or factual accuracy. Additionally, machine test-taking conditions, such as instant access to full prompts and unlimited processing speed, further blur comparisons to human cognition. The degree to which high IQ scores on structured tests translate to real-world language model performance remains uncertain.

Organizations such as LM-Eval, GPTZero, and MLCommons are increasingly relied upon to provide third-party assessments as model developers continue to limit disclosures about internal architectures and training methods. These “shadow evaluations” are shaping the emerging norms of large language model testing, especially in light of the opaque and often fragmented disclosures from leading AI firms. OpenAI’s o-series holds a commanding position in this testing workflow, though the long-term implications for general intelligence, agentic behavior, or ethical deployment remain to be addressed in more domain-relevant trials. The IQ scores, while provocative, serve more as signals of short-context proficiency than a definitive indicator of broader capabilities.

Per TrackingAI.org, additional analysis on format-based performance spreads and evaluation reliability will be necessary to clarify the validity of current benchmarks. With model releases accelerating and independent testing growing in sophistication, comparative metrics may continue to evolve in both format and interpretation. The finding reinforces the pattern of closed-source, proprietary models outperforming open-source counterparts in controlled cognitive evaluations, highlighting the ongoing advancements in AI technology.

Comments

Add a public comment...
Post
User avatar and name identifying the post author
Local-Store-491
04/17
GPT-4o needs to step up its game, lol
0
Reply
User avatar and name identifying the post author
Analytic_mindset1993
04/17
@Local-Store-491 GPT-4o just needs to HODL and maybe it'll moon in the next benchmark, lol.
0
Reply
User avatar and name identifying the post author
FTCommoner
04/17
Closed-source models crushing open-source ones is a trend. Wonder if we'll see a breakout open-source model soon?
0
Reply
User avatar and name identifying the post author
Similar_Panda7299
04/17
@FTCommoner Yeah, open-source lagging, maybe soon tho?
0
Reply
User avatar and name identifying the post author
meowmeowmrcow
04/17
o3's IQ score is straight fire, not surprising tho
0
Reply
User avatar and name identifying the post author
CaseEnvironmental824
04/17
GPT-4o scoring low on Mensa? Maybe OpenAI has another ace up its sleeve with o3 leading the pack. Competition pushing innovation hard.
0
Reply
User avatar and name identifying the post author
acg7
04/17
Real-world relevance unclear, more testing needed, imo
0
Reply
User avatar and name identifying the post author
twiggs462
04/17
Llama 4 Maverick showing open-source spirit, respect.
0
Reply
User avatar and name identifying the post author
TheOSU87
04/17
Multimodal models got work to do, still alpha 🤔
0
Reply
User avatar and name identifying the post author
notbutterface
04/17
Closed-source models crushing open-source ones in cognitive tests? Might be time for open-source devs to rethink strategies and catch up.
0
Reply
User avatar and name identifying the post author
Holiday_Context5033
04/17
o3's IQ score is mind-blowing, but let's not forget, it's still just a tool. AI hype train, full steam ahead! 🚂
0
Reply
User avatar and name identifying the post author
DumbStocker
04/17
o3's IQ score is mind-blowing, but let's not forget, it's still a language model with limitations. 🤔
0
Reply
User avatar and name identifying the post author
solidpaddy74
04/17
@DumbStocker True, it's a model, not human.
0
Reply
User avatar and name identifying the post author
lookingforfinaltix
04/17
Closed-source models dominate, but will it last? 🤔
0
Reply
User avatar and name identifying the post author
Miguel_Legacy
04/17
Llama 4 Maverick showing promise, but the GAP between corporate-backed and open-source models is real. Can open-source catch the torch?
0
Reply
User avatar and name identifying the post author
AlmightyAntwan12
04/17
@Miguel_Legacy Open-source has potential, but corporate backing fuels scale and speed.
0
Reply
User avatar and name identifying the post author
ConstructionOk6948
04/17
OpenAI's o-series dominating these tests, but what does it mean for general intelligence or ethical deployment? More domain-relevant trials needed.
0
Reply
User avatar and name identifying the post author
Accomplished-Bill-45
04/17
Wow!the block option data in MSTF stock saved me much money!
0
Reply
Disclaimer: The news articles available on this platform are generated in whole or in part by artificial intelligence and may not have been reviewed or fact checked by human editors. While we make reasonable efforts to ensure the quality and accuracy of the content, we make no representations or warranties, express or implied, as to the truthfulness, reliability, completeness, or timeliness of any information provided. It is your sole responsibility to independently verify any facts, statements, or claims prior to acting upon them. Ainvest Fintech Inc expressly disclaims all liability for any loss, damage, or harm arising from the use of or reliance on AI-generated content, including but not limited to direct, indirect, incidental, or consequential damages.
You Can Understand News Better with AI.
Whats the News impact on stock market?
Its impact is
fork
logo
AInvest
Aime Coplilot
Invest Smarter With AI Power.
Open App