Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.
翻译:东南亚地区拥有丰富的语言多样性和文化多元性,涵盖超过1,300种本土语言及6.71亿人口。然而,当前主流人工智能模型严重缺乏来自东南亚的文本、图像及音频数据集,导致面向东南亚语言的人工智能模型质量受限。由于高质量数据集的稀缺性,加之英语训练数据的主导地位,评估东南亚语言模型面临严峻挑战,并引发潜在文化误表征的担忧。为应对这些挑战,我们推出SEACrowd协作计划,通过构建整合性资源枢纽提供涵盖近千种东南亚语言、跨三种模态的标准化语料库,以填补资源空白。借助SEACrowd基准测试套件,我们在13项任务中对36种本土语言的人工智能模型质量进行评估,为东南亚人工智能发展现状提供关键洞见。此外,我们提出促进人工智能深度发展的策略,以最大化未来东南亚人工智能的潜在效用与资源公平性。