AI Performance Benchmarks Challenge Companies in High-Risk Industries
Companies are increasingly looking to deploy AI systems that outperform human capabilities. However, measuring AI performance against human benchmarks is not straightforward. During a roundtable discussion at the Oxford University Said School of Business, it was highlighted that most companies use existing human performance as the standard for AI deployment. For instance, a news agency has committed to using AI only if it has a lower average error rate than humans for the same task, such as translating news stories into foreign languages.
However, this benchmark may not always be appropriate. For example, an oil giant wanted to use a large language model (LLM) as a decision-support system for its safety and reliability engineers. The LLM scored 92% on a safety engineering exam, well above the pass mark and better than the average human grade. Yet, the 8% of questions the AI missed raised concerns. The team could not determine why the AI got those questions wrong, leading to a lack of confidence in deploying it, especially in high-risk areas where mistakes can be catastrophic.
This issue is not unique to the oil industry. In healthcare, AI systems that read medical scans are often assessed using average performance compared to human radiologists. However, overall error rates may not capture the most critical aspects. For example, an AI system might be better on average but more likely to miss the most aggressive cancers. In such cases, performance on a subset of the most consequential decisions matters more than average performance.
This challenge is particularly acute in high-risk domains. Companies want AI systems that are superhuman in decision-making and human-like in their reasoning. However, current AI methods struggle to achieve both simultaneously. AI systems are often described as "aliens" because they excel at specific tasks but do not understand or think like humans. A recent research paper illustrated this point by showing that AI reasoning models can be significantly impaired by seemingly irrelevant phrases, highlighting the unpredictability of AI behavior.
The societal acceptance of AI's alien nature depends on the domain. For instance, self-driving cars could reduce road accidents significantly, but their mistakes are often alien and unpredictable. Society values control, predictability, and the illusion of perfectibility, making it uncomfortable with systems that operate beyond human understanding. This unease extends to other high-risk areas, where the consequences of AI mistakes can be severe.
In conclusion, while companies strive for AI systems that outperform humans, the complexity of measuring and ensuring reliable AI performance remains a significant challenge. The nuanced nature of AI decision-making and the societal discomfort with unpredictable systems highlight the need for careful consideration and regulation in AI deployment. 
Quickly understand the history and background of various well-known coins
Latest Articles
Stay ahead of the market.
Get curated U.S. market news, insights and key dates delivered to your inbox.



Comments
No comments yet