Democratizing access to natural language processing (NLP) technology is crucial, especially for underrepresented and extremely low-resource languages. Previous research has focused on developing labeled and unlabeled corpora for these languages through online scraping and document translation. While these methods have proven effective and cost-efficient, we have identified limitations in the resulting corpora, including a lack of lexical diversity and cultural relevance to local communities. To address this gap, we conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content. In addition, we present the \datasetname{} benchmark, encompassing 12 underrepresented and extremely low-resource languages spoken by millions of individuals in Indonesia. Our empirical experiment results using existing multilingual large language models conclude the need to extend these models to more underrepresented languages. We release the NusaWrites dataset at https://github.com/IndoNLP/nusa-writes.
翻译:推动自然语言处理(NLP)技术的普及至关重要,尤其对于弱势及极度低资源语言。以往研究通过在线爬取和文档翻译为这些语言开发了标注与非标注语料库。尽管这些方法被证实有效且成本低廉,但我们发现生成的语料库存在词汇多样性不足、与当地社区文化相关性缺失等问题。为弥补这一缺口,我们以印度尼西亚地方语言为案例展开研究,对比在线爬取、人工翻译和母语者段落撰写三种数据集构建方法的有效性。结果表明,母语者通过段落撰写生成的数据集在词汇多样性和文化内容方面表现出更优质量。此外,我们提出了\datasetname{}基准测试集,涵盖印度尼西亚境内数百万人使用的12种弱势及极度低资源语言。利用现有多语言大语言模型进行的实证实验结果进一步证实,需将这些模型扩展至更多弱势语言。我们在https://github.com/IndoNLP/nusa-writes公开发布NusaWrites数据集。