Despite attempts to make Large Language Models multi-lingual, many of the world's languages are still severely under-resourced. This widens the performance gap between NLP and AI applications aimed at well-financed, and those aimed at less-resourced languages. In this paper, we focus on Nigerian Pidgin (NP), which is spoken by nearly 100 million people, but has comparatively very few NLP resources and corpora. We address the task of Implicit Discourse Relation Classification (IDRC) and systematically compare an approach translating NP data to English and then using a well-resourced IDRC tool and back-projecting the labels versus creating a synthetic discourse corpus for NP, in which we translate PDTB and project PDTB labels, and then train an NP IDR classifier. The latter approach of learning a "native" NP classifier outperforms our baseline by 13.27\% and 33.98\% in f$_{1}$ score for 4-way and 11-way classification, respectively.
翻译:尽管已有尝试使大语言模型具备多语言能力,但世界上许多语言仍处于严重资源匮乏状态。这扩大了面向资金充足语言与面向资源匮乏语言的NLP和AI应用之间的性能差距。本文聚焦于尼日利亚皮钦语(NP)——该语言使用者近一亿人,但其NLP资源和语料库相对稀缺。我们针对隐式篇章关系分类任务,系统比较了两种方法:一是将NP数据翻译为英语后使用资源丰富的IDRC工具处理并回传标签;二是为NP创建合成篇章语料库——通过翻译PDTB并投射PDTB标签,进而训练NP隐式篇章关系分类器。实验表明,学习"原生"NP分类器的后一种方法在四分类和十一分类任务中的f$_{1}$分数分别超越基线13.27%和33.98%。