The proliferation of online offensive language necessitates the development of effective detection mechanisms, especially in multilingual contexts. This study addresses the challenge by developing and introducing novel datasets for offensive language detection in three major Nigerian languages: Hausa, Yoruba, and Igbo. We collected data from Twitter and manually annotated it to create datasets for each of the three languages, using native speakers. We used pre-trained language models to evaluate their efficacy in detecting offensive language in our datasets. The best-performing model achieved an accuracy of 90\%. To further support research in offensive language detection, we plan to make the dataset and our models publicly available.
翻译:网络冒犯性语言的激增使得开发有效的检测机制变得尤为必要,在多语言语境下更是如此。本研究通过针对三种主要的尼日利亚语言——豪萨语、约鲁巴语和伊博语——开发并引入新颖的冒犯性语言检测数据集,以应对这一挑战。我们从推特平台收集数据,并聘请母语使用者进行人工标注,为这三种语言分别创建了数据集。我们采用预训练语言模型评估了它们在我们数据集上进行冒犯性语言检测的效果,其中表现最佳的模型准确率达到90%。为了进一步支持冒犯性语言检测领域的研究,我们计划将数据集及模型公开提供。