OpenAI's o3 Scores 136 on Mensa Norway IQ Test, Surpassing 98% of Humans
OpenAI’s latest language model, “o3,” has achieved a significant milestone by scoring 136 on the Mensa Norway intelligence test. This score places the model above approximately 98 percent of the human population, according to a standardized bell-curve IQ distribution used in the benchmarking. The score was calculated from a seven-run rolling average, reinforcing the model's cognitive capabilities.
The “o3” model is part of the “o-series” of large language models, which have consistently ranked at the top in various cognitive evaluations. The model was evaluated using two benchmark formats: a proprietary “Offline Test” curated by TrackingAI.org and a publicly available Mensa Norway test. Both tests were scored against a human mean of 100. While “o3” scored 116 on the Offline evaluation, it achieved a 136 on the Mensa test, indicating a 20-point boost. This discrepancy suggests that the model may have enhanced compatibility with the Mensa test's structure or data-related confounds such as prompt familiarity.
The Offline Test included 100 pattern-recognition questions designed to avoid any content that might have appeared in the data used to train AI models. Both assessments reported each model’s result as an average across the seven most recent completions. However, no standard deviation or confidence intervals were released alongside the final scores, limiting reproducibility and interpretability. The absence of methodological transparency, particularly around prompting strategies and scoring scale conversion, further complicates the analysis.
TrackingAI.org, the independent platform that compiled the data, uses a standardized prompt format designed to ensure broad AI compliance while minimizing interpretive ambiguity. Each language model is presented with a statement followed by four Likert-style response options and is instructed to select one while justifying its choice. Responses must be clearly formatted, and if a model refuses to answer, the prompt is repeated up to ten times. The most recent successful response is then recorded for scoring purposes, with refusal events noted separately. This methodology aims to provide consistency in comparative assessments while documenting non-responsiveness as a data point in itself.
The Mensa Norway test highlighted the performance gap between the top models, with “o3” leading the pack with a 136 IQ score. In contrast, other popular models like GPT-4o scored considerably lower, landing at 95 on Mensa and 64 on Offline. Among open-source submissions, Meta’s Llama 4 Maverick was the highest-ranked, posting a 106 IQ on Mensa and 97 on the Offline benchmark. Most Apache-licensed entries fell within the 60–90 range, reinforcing the current limitations of community-built architectures relative to corporate-backed research pipelines.
Multimodal models, which incorporate image input capabilities, consistently underperformed their text-only versions. For instance, OpenAI’s “o1 Pro” scored 107 on the Offline test in its text configuration but dropped to 97 in its vision-enabled version. The discrepancy was more pronounced on the Mensa test, where the text-only variant achieved 122 compared to 86 for the visual version. This suggests that some methods of multimodal pretraining may introduce reasoning inefficiencies that remain unresolved at present. However, “o3” can also analyze and interpret images to a very high standard, breaking this trend.
IQ benchmarks provide a narrow window into a model’s reasoning capability, with short-context pattern matching offering only limited insights into broader cognitive behavior such as multi-turn reasoning, planning, or factual accuracy. Additionally, machine test-taking conditions, such as instant access to full prompts and unlimited processing speed, further blur comparisons to human cognition. The degree to which high IQ scores on structured tests translate to real-world language model performance remains uncertain.
Organizations such as LM-Eval, GPTZero, and MLCommons are increasingly relied upon to provide third-party assessments as model developers continue to limit disclosures about internal architectures and training methods. These “shadow evaluations” are shaping the emerging norms of large language model testing, especially in light of the opaque and often fragmented disclosures from leading AI firms. OpenAI’s o-series holds a commanding position in this testing workflow, though the long-term implications for general intelligence, agentic behavior, or ethical deployment remain to be addressed in more domain-relevant trials. The IQ scores, while provocative, serve more as signals of short-context proficiency than a definitive indicator of broader capabilities.
Per TrackingAI.org, additional analysis on format-based performance spreads and evaluation reliability will be necessary to clarify the validity of current benchmarks. With model releases accelerating and independent testing growing in sophistication, comparative metrics may continue to evolve in both format and interpretation. The finding reinforces the pattern of closed-source, proprietary models outperforming open-source counterparts in controlled cognitive evaluations, highlighting the ongoing advancements in AI technology.
