xAI hired contractors to train Grok on coding tasks to top a popular AI leaderboard, specifically Anthropic's Claude model. Documents obtained by Business Insider show xAI wanted Grok to outperform Claude in coding tasks. AI leaderboards have become a key battleground for labs chasing clout and investment.
xAI, Elon Musk's artificial intelligence company, has been actively engaged in a competitive effort to enhance its Grok model's performance on AI leaderboards. According to documents obtained by Business Insider, xAI hired contractors through Scale AI's Outlier platform to train Grok on coding tasks, aiming to outperform Anthropic's Claude model [1]. This strategic move highlights the growing importance of AI leaderboards as a key battleground for attracting investment and media attention.
The focus on leaderboards is not new, but the intensity and tactics employed have evolved. For instance, OpenAI and Anthropic have been known to leverage leaderboard rankings to secure funding and lucrative contracts. The industry's reliance on these unofficial scoreboards has led to intense competition, with some companies employing questionable tactics to boost their rankings [1].
xAI's Grok 4 model, launched on July 9, 2025, was specifically trained to "beat Sonnet 3.7 Extended," a reference to Anthropic's Claude model. The project involved contractors refining front-end code for user interface prompts to improve Grok's performance on WebDev Arena, an influential leaderboard from LMArena [1]. While Grok 4 ranked in the top three for LMArena's core categories of math, coding, and "Hard Prompts," its performance varied across different leaderboards, indicating the challenges of benchmarking AI models.
The race for leaderboard dominance is not without its controversies. For example, Meta's Llama 4 model faced accusations of gaming the leaderboard by using a different variant for public benchmarking than the one released to the public [1]. Similarly, xAI's project with Scale AI, while not proven to be gaming the leaderboard, underscores the competitive nature of the AI industry.
Despite the hype surrounding Grok 4, some users have expressed disappointment with its performance in real-world applications, particularly in coding tasks. This highlights the potential disconnect between leaderboard rankings and actual model capabilities [2]. The focus on leaderboard dominance can sometimes lead to models that excel in trivial exercises but falter when facing real-world challenges [1].
In addition to its leaderboard ambitions, xAI has also secured a contract with the U.S. Department of Defense to launch "Grok for Government," a specialized version of its chatbot platform tailored for federal use. This move comes as the Trump administration continues to spur federal AI adoption [3]. SpaceX, another Musk-owned company, has committed $2 billion to xAI's expansion, further signaling Musk's investment in the AI business [3].
In conclusion, xAI's Grok model's climb to the AI leaderboard is a strategic move to enhance its competitive standing in the AI industry. While the project has faced criticism and controversy, it also highlights the importance of AI leaderboards as a key metric for attracting investment and media attention. As the AI industry continues to evolve, the focus on leaderboard dominance is likely to remain a significant factor in shaping the competitive landscape.
References:
[1] https://www.businessinsider.com/grok-leaderboard-coding-anthropic-claude-scale-ai-2025-7
[2] https://www.reddit.com/r/singularity/comments/1lyzqzg/grok_4_disappointment_is_evidence_that_benchmarks/
[3] https://cryptorank.io/news/feed/4f030-elon-musks-xai-launches-grok-for-government
Comments
No comments yet