SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

Holy Lovenia,Rahmad Mahendra,Salsabil Maulana Akbar,Lester James V. Miranda,Jennifer Santoso,Elyanah Aco,Akhdan Fadhilah,Jonibek Mansurov,Joseph Marvin Imperial,Onno P. Kampman,Joel Ruben Antony Moniz,Muhammad Ravi Shulthan Habibi,Frederikus Hudi,Railey Montalan,Ryan Ignatius,Joanito Agili Lopo,William Nixon,Börje F. Karlsson,James Jaya,Ryandito Diandaru,Yuze Gao,Patrick Amadeus,Bin Wang,Jan Christian Blaise Cruz,Chenxi Whitehouse,Ivan Halim Parmonangan,Maria Khelli,Wenyu Zhang,Lucky Susanto,Reynard Adha Ryanda,Sonny Lazuardi Hermawan,Dan John Velasco,Muhammad Dehan Al Kautsar,Willy Fitra Hendria,Yasmin Moslem,Noah Flynn,Muhammad Farid Adilazuarda,Haochen Li,Johanes Lee,R. Damanhuri,Shuo Sun,Muhammad Reza Qorib,Amirbek Djanibekov,Wei Qi Leong,Quyet V. Do,Niklas Muennighoff,Tanrada Pansuwan,Ilham Firdausi Putra,Yan Xu,Ngee Chia Tai,Ayu Purwarianti,Sebastian Ruder,William Tjhi,Peerat Limkonchotiwat,Alham Fikri Aji,Sedrick Keh,Genta Indra Winata,Ruochen Zhang,Fajri Koto,Zheng-Xin Yong,Samuel Cahyawijaya

from arxiv, https://github.com/SEACrowd

Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.

翻译：东南亚地区拥有丰富的语言多样性和文化多元性，涵盖超过1,300种本土语言及6.71亿人口。然而，当前主流人工智能模型严重缺乏来自东南亚的文本、图像及音频数据集，导致面向东南亚语言的人工智能模型质量受限。由于高质量数据集的稀缺性，加之英语训练数据的主导地位，评估东南亚语言模型面临严峻挑战，并引发潜在文化误表征的担忧。为应对这些挑战，我们推出SEACrowd协作计划，通过构建整合性资源枢纽提供涵盖近千种东南亚语言、跨三种模态的标准化语料库，以填补资源空白。借助SEACrowd基准测试套件，我们在13项任务中对36种本土语言的人工智能模型质量进行评估，为东南亚人工智能发展现状提供关键洞见。此外，我们提出促进人工智能深度发展的策略，以最大化未来东南亚人工智能的潜在效用与资源公平性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日