AI Models Struggle with Non-English Languages Due to Data Scarcity

The world's leading AI models are predominantly trained in English, which poses significant challenges for languages with less digital content, such as Cantonese. Jacky Chan, the founder of Votee, highlights the difficulty in translating terms like "dim sum" into English, as machines struggle to differentiate between loanwords and terms that require direct translation. This issue is part of a broader problem where AI models, while proficient in English, lag behind in other languages due to limited data availability.
AI models rely heavily on data from books, articles, and websites to learn and generate text. In English, there is an abundance of digital resources, but many other languages do not have the same luxury. This disparity leads to performance gaps, where non-English AI models are more likely to produce inaccurate or nonsensical outputs.
Low-resource languages, which include widely spoken languages like Cantonese, Vietnamese, and Bahasa Indonesia, face additional challenges. Limited internet access and government regulations can hinder the creation and availability of digital content in these languages. For instance, Indonesia's regulations can lead to self-censorship, resulting in data that may not accurately represent local culture.
The performance gap is exacerbated by the complexity of languages that do not use Latin script or have tonal features, making them difficult to represent in writing or code. Currently, the best-performing AI models are in English and, to a lesser extent, Mandarin Chinese, reflecting the locations of major tech companies. However, efforts are being made to develop AI models tailored to specific countries' needs.
South Korean internet firm Naver has developed HyperCLOVA X, an AI model trained on significantly more Korean data than GPT-4. The company is also expanding into markets like Saudi Arabia and Thailand to create "sovereign AI." In Indonesia, telecom operator Indosat and tech startup Goto are collaborating to launch an AI model that operates in Bahasa Indonesia and five other local languages.
One of the main challenges in developing non-English AI models is achieving the scale of leading English and Chinese models. The most powerful AI models have billions of parameters, but non-English models struggle to reach this scale. For example, the Southeast Asian Languages in One Model (SEA-LION) project has trained models with 3 billion and 7 billion parameters, much smaller than leading English and Chinese models.
Jacky Chan from Votee faces these challenges when developing AI models for Cantonese. The language has different grammar for formal writing compared to informal writing and speech, and available digital data is often low-quality. Training on such data is like "learning from a library with many books, but they have lots of typos, they are poorly translated, or they’re just plain wrong."
Without a comprehensive dataset, an AI model cannot produce complete results. Data for low-resource languages often skews towards formal texts, which can distort an AI model's tone, vocabulary, and style. AI models have no inherent sense of what is true, so false or incomplete information will be reproduced as fact. A model trained solely on Vietnamese pop music might struggle to accurately answer questions on historical events.
One solution to supplement limited training data is to translate English content into the target language. However, machine translation carries risks, such as missing linguistic nuances or cultural context. A study by Georgia Tech found that AI models trained on Arabic datasets exhibited Western bias due to machine-translated content from English.
In the long term, AI-generated content could pollute low-resource language datasets, leading to "model collapse" where AI-generated text contaminates the training data for future AI models. This threat is even greater for low-resource languages, where AI-generated content could quickly make up a larger share of what's online.
Large businesses are starting to recognize the opportunities in building non-English AI models. However, these companies are still much smaller than giants like Alibaba, OpenAI, and
. More organizations, both for-profit and not-for-profit, need to invest in multilingual AI to make this technology truly global.Aliya Bhatia, a senior policy analyst at the Center of Democracy & Technology, emphasizes the importance of multilingual AI in providing access to economic opportunities and educational resources. "If LLMs are going to be used to equip people with access to economic opportunities, educational resources, and more, they should work in the languages people use," she says.

Sign up for free to continue reading
By continuing, I agree to the
Market Data Terms of Service and Privacy Statement
Comments
No comments yet