Africa has over 2000 indigenous languages but they are under-represented in NLP research due to lack of datasets. In recent years, there have been progress in developing labeled corpora for African languages. However, they are often available in a single domain and may not generalize to other domains. In this paper, we focus on the task of sentiment classification for cross domain adaptation. We create a new dataset, NollySenti - based on the Nollywood movie reviews for five languages widely spoken in Nigeria (English, Hausa, Igbo, Nigerian-Pidgin, and Yoruba. We provide an extensive empirical evaluation using classical machine learning methods and pre-trained language models. Leveraging transfer learning, we compare the performance of cross-domain adaptation from Twitter domain, and cross-lingual adaptation from English language. Our evaluation shows that transfer from English in the same target domain leads to more than 5% improvement in accuracy compared to transfer from Twitter in the same language. To further mitigate the domain difference, we leverage machine translation (MT) from English to other Nigerian languages, which leads to a further improvement of 7% over cross-lingual evaluation. While MT to low-resource languages are often of low quality, through human evaluation, we show that most of the translated sentences preserve the sentiment of the original English reviews.
翻译:非洲有超过2000种本土语言,但由于缺乏数据集,这些语言在自然语言处理研究中代表性不足。近年来,非洲语言标注语料库的构建取得了一定进展,然而这些语料库通常局限于单一领域,可能无法推广至其他领域。本文聚焦跨领域适应中的情感分类任务,基于尼日利亚五种广泛使用语言(英语、豪萨语、伊博语、尼日利亚皮钦语和约鲁巴语)的诺莱坞电影评论,构建了新数据集NollySenti。我们利用传统机器学习方法和预训练语言模型进行了广泛的实证评估。借助迁移学习,我们比较了从推特领域的跨领域适应性能与从英语的跨语言适应性能。评估表明,在相同目标领域内从英语迁移相比从同语言推特领域迁移,准确率提升超过5%。为进一步缩小领域差异,我们利用从英语到其他尼日利亚语言的机器翻译技术,在跨语言评估基础上实现了7%的准确率提升。尽管针对低资源语言的机器翻译质量通常较低,但通过人工评估,我们证实大多数翻译句子保留了原始英语评论的情感倾向。