NollySenti: Leveraging Transfer Learning and Machine Translation for Nigerian Movie Sentiment Classification

Africa has over 2000 indigenous languages but they are under-represented in NLP research due to lack of datasets. In recent years, there have been progress in developing labeled corpora for African languages. However, they are often available in a single domain and may not generalize to other domains. In this paper, we focus on the task of sentiment classification for cross domain adaptation. We create a new dataset, NollySenti - based on the Nollywood movie reviews for five languages widely spoken in Nigeria (English, Hausa, Igbo, Nigerian-Pidgin, and Yoruba. We provide an extensive empirical evaluation using classical machine learning methods and pre-trained language models. Leveraging transfer learning, we compare the performance of cross-domain adaptation from Twitter domain, and cross-lingual adaptation from English language. Our evaluation shows that transfer from English in the same target domain leads to more than 5% improvement in accuracy compared to transfer from Twitter in the same language. To further mitigate the domain difference, we leverage machine translation (MT) from English to other Nigerian languages, which leads to a further improvement of 7% over cross-lingual evaluation. While MT to low-resource languages are often of low quality, through human evaluation, we show that most of the translated sentences preserve the sentiment of the original English reviews.

翻译：非洲有超过2000种本土语言，但由于缺乏数据集，这些语言在自然语言处理研究中代表性不足。近年来，非洲语言标注语料库的构建取得了一定进展，然而这些语料库通常局限于单一领域，可能无法推广至其他领域。本文聚焦跨领域适应中的情感分类任务，基于尼日利亚五种广泛使用语言（英语、豪萨语、伊博语、尼日利亚皮钦语和约鲁巴语）的诺莱坞电影评论，构建了新数据集NollySenti。我们利用传统机器学习方法和预训练语言模型进行了广泛的实证评估。借助迁移学习，我们比较了从推特领域的跨领域适应性能与从英语的跨语言适应性能。评估表明，在相同目标领域内从英语迁移相比从同语言推特领域迁移，准确率提升超过5%。为进一步缩小领域差异，我们利用从英语到其他尼日利亚语言的机器翻译技术，在跨语言评估基础上实现了7%的准确率提升。尽管针对低资源语言的机器翻译质量通常较低，但通过人工评估，我们证实大多数翻译句子保留了原始英语评论的情感倾向。

相关内容

Machine Translation

关注 210

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日