Africa is home to over 2000 languages from over six language families and has the highest linguistic diversity among all continents. This includes 75 languages with at least one million speakers each. Yet, there is little NLP research conducted on African languages. Crucial in enabling such research is the availability of high-quality annotated datasets. In this paper, we introduce AfriSenti, which consists of 14 sentiment datasets of 110,000+ tweets in 14 African languages (Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili, Tigrinya, Twi, Xitsonga, and Yor\`ub\'a) from four language families annotated by native speakers. The data is used in SemEval 2023 Task 12, the first Afro-centric SemEval shared task. We describe the data collection methodology, annotation process, and related challenges when curating each of the datasets. We conduct experiments with different sentiment classification baselines and discuss their usefulness. We hope AfriSenti enables new work on under-represented languages. The dataset is available at https://github.com/afrisenti-semeval/afrisent-semeval-2023 and can also be loaded as a huggingface datasets (https://huggingface.co/datasets/shmuhammad/AfriSenti).
翻译:非洲拥有超过2000种语言,分属六大语系,是语言多样性最高的大洲。其中包括75种使用者超过百万的语言。然而,目前针对非洲语言的自然语言处理研究仍十分有限。推动此类研究的关键在于获取高质量标注数据集。本文介绍了AfriSenti——包含来自14种非洲语言的11万余条推文的14个情感数据集,涵盖阿姆哈拉语、阿尔及利亚阿拉伯语、豪萨语、伊博语、卢旺达语、摩洛哥阿拉伯语、莫桑比克葡萄牙语、尼日利亚皮钦语、奥罗莫语、斯瓦希里语、提格雷尼亚语、契维语、聪加语和约鲁巴语,这些语言来自四个语系,由母语者完成标注。该数据被用于首届以非洲语言为核心的SemEval 2023第十二项任务。我们描述了各数据集的数据收集方法、标注流程及相关挑战。我们还进行了不同情感分类基准的实验,并探讨其适用性。我们期待AfriSenti能推动对低资源语言的研究。该数据集可通过https://github.com/afrisenti-semeval/afrisent-semeval-2023获取,也可作为Huggingface数据集(https://huggingface.co/datasets/shmuhammad/AfriSenti)直接加载。