Africa is home to over 2000 languages from over six language families and has the highest linguistic diversity among all continents. This includes 75 languages with at least one million speakers each. Yet, there is little NLP research conducted on African languages. Crucial in enabling such research is the availability of high-quality annotated datasets. In this paper, we introduce AfriSenti, which consists of 14 sentiment datasets of 110,000+ tweets in 14 African languages (Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili, Tigrinya, Twi, Xitsonga, and Yor\`ub\'a) from four language families annotated by native speakers. The data is used in SemEval 2023 Task 12, the first Afro-centric SemEval shared task. We describe the data collection methodology, annotation process, and related challenges when curating each of the datasets. We conduct experiments with different sentiment classification baselines and discuss their usefulness. We hope AfriSenti enables new work on under-represented languages. The dataset is available at https://github.com/afrisenti-semeval/afrisent-semeval-2023 and can also be loaded as a huggingface datasets (https://huggingface.co/datasets/shmuhammad/AfriSenti).
翻译:非洲拥有来自六个以上语系的2000多种语言,在所有大洲中语言多样性最高,其中包括75种各拥有至少100万使用者的语言。然而,针对非洲语言的自然语言处理研究仍十分匮乏。开展此类研究的关键在于获取高质量标注数据集。本文介绍了AfriSenti数据集,该数据集包含来自四个语系的14种非洲语言(阿姆哈拉语、阿尔及利亚阿拉伯语、豪萨语、伊博语、卢旺达语、摩洛哥阿拉伯语、莫桑比克葡萄牙语、尼日利亚皮钦语、奥罗莫语、斯瓦希里语、提格雷尼亚语、契维语、聪加语和约鲁巴语)的110,000余条推文,共14个情感数据集,均由母语者完成标注。该数据应用于SemEval 2023第12项任务——首个以非洲语言为中心的SemEval共享任务。本文描述了各数据集的收集方法、标注流程及相关挑战,并通过不同情感分类基线实验探讨其有效性。我们期望AfriSenti能推动针对低资源语言的新研究。数据集发布在https://github.com/afrisenti-semeval/afrisent-semeval-2023,也可通过Hugging Face数据集库(https://huggingface.co/datasets/shmuhammad/AfriSenti)加载。