Democratizing access to natural language processing (NLP) technology is crucial, especially for underrepresented and extremely low-resource languages. Previous research has focused on developing labeled and unlabeled corpora for these languages through online scraping and document translation. While these methods have proven effective and cost-efficient, we have identified limitations in the resulting corpora, including a lack of lexical diversity and cultural relevance to local communities. To address this gap, we conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content. In addition, we present the \datasetname{} benchmark, encompassing 12 underrepresented and extremely low-resource languages spoken by millions of individuals in Indonesia. Our empirical experiment results using existing multilingual large language models conclude the need to extend these models to more underrepresented languages. We release the NusaWrites dataset at https://github.com/IndoNLP/nusa-writes.
翻译:推动自然语言处理(NLP)技术的普及至关重要,尤其对于代表性不足和极度低资源的语言而言。此前研究主要通过在线抓取和文档翻译来开发这些语言的标注与非标注语料库。尽管这些方法已被证明有效且经济,但我们发现生成的语料库存在词汇多样性不足、与当地社区文化关联性弱等问题。为弥补这一空白,我们以印度尼西亚本土语言为例开展案例研究,比较了在线抓取、人工翻译和母语者段落写作三种数据集构建方式的效率。结果表明,由母语者通过段落写作生成的数据集在词汇多样性和文化内涵方面质量更优。此外,我们提出了\datasetname{}基准数据集,涵盖印度尼西亚境内数百万使用者所使用的12种代表性不足且极度低资源的语言。基于现有多语言大语言模型的实验结果表明,有必要将这些模型推广至更多代表性不足的语言。我们在https://github.com/IndoNLP/nusa-writes公开发布NusaWrites数据集。