To address the global issue of hateful content proliferating in online platforms, hate speech detection (HSD) models are typically developed on datasets collected in the United States, thereby failing to generalize to English dialects from the Majority World. Furthermore, HSD models are often evaluated on curated samples, raising concerns about overestimating model performance in real-world settings. In this work, we introduce NaijaHate, the first dataset annotated for HSD which contains a representative sample of Nigerian tweets. We demonstrate that HSD evaluated on biased datasets traditionally used in the literature largely overestimates real-world performance on representative data. We also propose NaijaXLM-T, a pretrained model tailored to the Nigerian Twitter context, and establish the key role played by domain-adaptive pretraining and finetuning in maximizing HSD performance. Finally, we show that in this context, a human-in-the-loop approach to content moderation where humans review 1% of Nigerian tweets flagged as hateful would enable to moderate 60% of all hateful content. Taken together, these results pave the way towards robust HSD systems and a better protection of social media users from hateful content in low-resource settings.
翻译:为应对在线平台中仇恨内容泛滥的全球性问题,仇恨言论检测模型通常基于美国收集的数据集开发,因此难以泛化至多数世界的英语方言。此外,HSD模型常以精心筛选的样本进行评估,这引发了对真实场景中模型性能被高估的担忧。本研究首次提出包含尼日利亚推特代表性样本的HSD标注数据集NaijaHate。我们证明,基于文献中传统带偏数据集评估的HSD模型,在代表性数据上的真实性能往往被严重高估。同时,我们提出了专为尼日利亚推特场景定制的预训练模型NaijaXLM-T,并确立了领域自适应预训练与微调在最大化HSD性能中的关键作用。最后,我们发现,在此场景中采用人类参与的内容审核方法——即人工审阅1%被标记为仇恨的尼日利亚推文——可实现对60%仇恨内容的有效审核。这些成果共同为构建稳健的HSD系统、在低资源环境下更好地保护社交媒体用户免受仇恨内容侵害奠定了基础。