AInvest Newsletter
Daily stocks & crypto headlines, free to your inbox
Leading AI chatbots, while proficient in English, face significant challenges when operating in non-English languages due to cultural and linguistic nuances. Experts highlight that large language models (LLMs) often lack contextual understanding of local cultures, leading to misinterpretations or ineffective responses. Kalika Bali, a senior principal researcher at
Research India, described such models as “a bit like a Fulbright scholar who is interested in Asia as their area of study”—they possess knowledge but lack the cultural depth of an insider’s perspective [1]. For instance, a math problem involving “key lime pie” fails to resonate in regions like India, where the concept is unfamiliar [2].The disparity stems from the imbalance in digital content availability. Roughly half of global web content is in English, creating a rich training dataset for English-language models. However, for other languages, developers must resort to labor-intensive methods, such as manually curating high-quality data from open-source repositories. Kasima Tharnpipitchai, head of AI strategy at SCB 10X, detailed how Thai LLM Typhoon was developed by native speakers sifting through datasets to distinguish quality content, a process he termed “brute force” [3]. Despite these efforts, non-English models still lag: Typhoon outperformed GPT-3.5 in Thai, a result Tharnpipitchai attributed more to GPT-3.5’s poor performance than Typhoon’s superiority [3].
Legal and cultural hurdles further complicate data collection. Malaysian startup Mesolitica, developing a Malay LLM, reported data owners requesting their removal from open-source training datasets, reducing the already limited pool of high-quality material. Khalil Nooh, Mesolitica’s CEO, emphasized the need to collaborate with private dataset owners to address this gap [3]. Synthetic data generation—translating English content into other languages—is emerging as a solution. Nooh highlighted its use in capturing Malaysia’s 16 dialects, which lack substantial digital presence [3].
Cultural sensitivities pose additional barriers. In India, Bali noted that even tech-positive communities avoid sharing certain topics during data collection. Similarly, Malaysia’s “three Rs”—race, religion, and royalty—are deemed sensitive, requiring careful handling. SCB 10X addressed this by developing a “safety model” for public sector use in Thailand, alongside its standard Typhoon model [3].
Experts agree that overcoming these challenges requires sustained effort in data curation, synthetic generation, and cultural awareness. As global demand for multilingual AI grows, the ability to navigate these complexities will determine the success of non-English LLMs.
Source:
[1] [title:AI chatbots struggle to function beyond English: ‘They know a lot…but they miss the culture’] [url:https://fortune.com/asia/2025/07/25/ai-chatbots-malay-india-thailand-english-llm-brainstorm-ai-singapore/]
[2] [title:AI chatbots struggle to function beyond English: ‘They know a lot…but they miss the culture’] [url:https://fortune.com/asia/2025/07/25/ai-chatbots-malay-india-thailand-english-llm-brainstorm-ai-singapore/]
[3] [title:AI chatbots struggle to function beyond English: ‘They know a lot…but they miss the culture’] [url:https://fortune.com/asia/2025/07/25/ai-chatbots-malay-india-thailand-english-llm-brainstorm-ai-singapore/]

Quickly understand the history and background of various well-known coins

Dec.02 2025

Dec.02 2025

Dec.02 2025

Dec.02 2025

Dec.02 2025
Daily stocks & crypto headlines, free to your inbox
Comments
No comments yet