Symbols

OpenAI's o3 Scores 136 on Mensa Norway IQ Test, Surpassing 98% of Humans

Generated by AI AgentCoin World

Thursday, Apr 17, 2025 3:08 pm ET2min read

OpenAI’s latest language model, “o3,” has achieved a significant milestone by scoring 136 on the Mensa Norway intelligence test. This score places the model above approximately 98 percent of the human population, according to a standardized bell-curve

distribution used in the benchmarking. The score was calculated from a seven-run rolling average, reinforcing the model's cognitive capabilities.

The “o3” model is part of the “o-series” of large language models, which have consistently ranked at the top in various cognitive evaluations. The model was evaluated using two benchmark formats: a proprietary “Offline Test” curated by TrackingAI.org and a publicly available Mensa Norway test. Both tests were scored against a human mean of 100. While “o3” scored 116 on the Offline evaluation, it achieved a 136 on the Mensa test, indicating a 20-point boost. This discrepancy suggests that the model may have enhanced compatibility with the Mensa test's

structure

or data-related confounds such as prompt familiarity.

The Offline Test included 100 pattern-recognition questions designed to avoid any content that might have appeared in the data used to train AI models. Both assessments reported each model’s result as an average across the seven most recent completions. However, no standard deviation or confidence intervals were released alongside the final scores, limiting reproducibility and interpretability. The absence of methodological transparency, particularly around prompting strategies and scoring scale conversion, further complicates the analysis.

TrackingAI.org, the independent platform that compiled the data, uses a standardized prompt format designed to ensure broad AI compliance while minimizing interpretive ambiguity. Each language model is presented with a statement followed by four Likert-style response options and is instructed to select one while justifying its choice. Responses must be clearly formatted, and if a model refuses to answer, the prompt is repeated up to ten times. The most recent successful response is then recorded for scoring purposes, with refusal events noted separately. This methodology aims to provide consistency in comparative assessments while documenting non-responsiveness as a data point in itself.

The Mensa Norway test highlighted the performance gap between the top models, with “o3” leading the pack with a 136 IQ score. In contrast, other popular models like GPT-4o scored considerably lower, landing at 95 on Mensa and 64 on Offline. Among open-source submissions, Meta’s Llama 4 Maverick was the highest-ranked, posting a 106 IQ on Mensa and 97 on the Offline benchmark. Most Apache-licensed entries fell within the 60–90 range, reinforcing the current limitations of community-built architectures relative to corporate-backed research pipelines.

Multimodal models, which incorporate image input capabilities, consistently underperformed their text-only versions. For instance, OpenAI’s “o1 Pro” scored 107 on the Offline test in its text configuration but dropped to 97 in its vision-enabled version. The discrepancy was more pronounced on the Mensa test, where the text-only variant achieved 122 compared to 86 for the visual version. This suggests that some methods of multimodal pretraining may introduce reasoning inefficiencies that remain unresolved at present. However, “o3” can also analyze and interpret images to a very high standard, breaking this trend.

IQ benchmarks provide a narrow window into a model’s reasoning capability, with short-context pattern matching offering only limited insights into broader cognitive behavior such as multi-turn reasoning, planning, or factual accuracy. Additionally, machine test-taking conditions, such as instant access to full prompts and unlimited processing speed, further blur comparisons to human cognition. The degree to which high IQ scores on structured tests translate to real-world language model performance remains uncertain.

Organizations such as LM-Eval, GPTZero, and MLCommons are increasingly relied upon to provide third-party assessments as model developers continue to limit disclosures about internal architectures and training methods. These “shadow evaluations” are shaping the emerging norms of large language model testing, especially in light of the opaque and often fragmented disclosures from leading AI firms. OpenAI’s o-series holds a commanding position in this testing workflow, though the long-term implications for general intelligence, agentic behavior, or ethical deployment remain to be addressed in more domain-relevant trials. The IQ scores, while provocative, serve more as signals of short-context proficiency than a definitive indicator of broader capabilities.

Per TrackingAI.org, additional analysis on format-based performance spreads and evaluation reliability will be necessary to clarify the validity of current benchmarks. With model releases accelerating and independent testing growing in sophistication, comparative metrics may continue to evolve in both format and interpretation. The finding reinforces the pattern of closed-source, proprietary models outperforming open-source counterparts in controlled cognitive evaluations, highlighting the ongoing advancements in AI technology.

Coin World

Quickly understand the history and background of various well-known coins

Latest Articles

XRP News Today: Vanguard Shifts Stance on Crypto ETFs, Citing Matured Markets and Demand

Dec.02 2025

Alphabet's AI Ecosystem Fuels Flywheel Growth, Propelling Stock 68% in 2025

Dec.02 2025

Striking Baristas Secure $38.9M in Restitution, But Contract Battles Brew On

Dec.02 2025

Bitcoin News Today: Bitcoin's RSI Signals Cyclical Reset as Market Waits for Fed's Decisive Move

Dec.02 2025

Corporate Strategies Test Balance Between Growth and Sustainability

Dec.02 2025

Comments

﻿

Add a public comment...

No comments yet

Ainvest News articles are AI-generated using advanced large language model (LLM) technology designed to analyze and synthesize publicly available data and news. While AI tools assist in producing initial drafts and insights, all AI generated content published on Ainvest is subject to comprehensive human editorial review before publication. A designated human editor reviews, verifies, and approves each article for factual accuracy, coherence, and compliance with AInvest Fintech Inc’s editorial and disclosure standards. Despite these review procedures, the information provided may still contain inaccuracies, incomplete data, or time sensitive references due to the inherent limitations of AI generated analysis and evolving changes. The material is provided strictly for informational and educational purposes and does not constitute personalized investment, legal, or financial advice. Readers should independently verify all facts, figures, and statements before making any decision based on this content. AInvest Fintech Inc. its affiliates, and partners accept no liability or responsibility for any direct or consequential losses arising from reliance on this content. All articles and analyses are subject to revision, update, or removal without prior notice to maintain accuracy and integrity in our publications.

OpenAI's o3 Scores 136 on Mensa Norway IQ Test, Surpassing 98% of Humans

Latest Articles

XRP News Today: Vanguard Shifts Stance on Crypto ETFs, Citing Matured Markets and Demand

Alphabet's AI Ecosystem Fuels Flywheel Growth, Propelling Stock 68% in 2025

Striking Baristas Secure $38.9M in Restitution, But Contract Battles Brew On

Bitcoin News Today: Bitcoin's RSI Signals Cyclical Reset as Market Waits for Fed's Decisive Move

Corporate Strategies Test Balance Between Growth and Sustainability

Or continue with others

Comments