Africa has over 2000 indigenous languages but they are under-represented in NLP research due to lack of datasets. In recent years, there have been progress in developing labeled corpora for African languages. However, they are often available in a single domain and may not generalize to other domains. In this paper, we focus on the task of sentiment classification for cross domain adaptation. We create a new dataset, NollySenti - based on the Nollywood movie reviews for five languages widely spoken in Nigeria (English, Hausa, Igbo, Nigerian-Pidgin, and Yoruba. We provide an extensive empirical evaluation using classical machine learning methods and pre-trained language models. Leveraging transfer learning, we compare the performance of cross-domain adaptation from Twitter domain, and cross-lingual adaptation from English language. Our evaluation shows that transfer from English in the same target domain leads to more than 5% improvement in accuracy compared to transfer from Twitter in the same language. To further mitigate the domain difference, we leverage machine translation (MT) from English to other Nigerian languages, which leads to a further improvement of 7% over cross-lingual evaluation. While MT to low-resource languages are often of low quality, through human evaluation, we show that most of the translated sentences preserve the sentiment of the original English reviews.
翻译:非洲拥有2000多种本土语言,但由于缺乏数据集,这些语言在自然语言处理研究中代表性不足。近年来,虽然非洲语言的标注语料库开发取得进展,但这些语料库通常仅适用于单一领域,难以推广至其他领域。本文聚焦跨领域适应的情感分类任务,基于尼日利亚电影(Nollywood)评论创建了新数据集NollySenti,涵盖尼日利亚五种广泛使用的语言(英语、豪萨语、伊博语、尼日利亚皮钦语和约鲁巴语)。通过采用经典机器学习方法和预训练语言模型,我们进行了广泛的实证评估。借助迁移学习,我们比较了来自Twitter领域的跨领域适应与来自英语的跨语言适应的性能。评估表明,在同一目标领域中,从英语迁移相比从同语言Twitter数据迁移,准确率提升超过5%。为进一步缓解领域差异,我们利用从英语到其他尼日利亚语言的机器翻译(MT),在跨语言评估基础上额外获得7%的性能提升。尽管面向低资源语言的机器翻译质量通常较低,但人工评估显示,大多数翻译句子保留了原始英语评论的情感倾向。