The automatic detection of hate speech online is an active research area in NLP. Most of the studies to date are based on social media datasets that contribute to the creation of hate speech detection models trained on them. However, data creation processes contain their own biases, and models inherently learn from these dataset-specific biases. In this paper, we perform a large-scale cross-dataset comparison where we fine-tune language models on different hate speech detection datasets. This analysis shows how some datasets are more generalisable than others when used as training data. Crucially, our experiments show how combining hate speech detection datasets can contribute to the development of robust hate speech detection models. This robustness holds even when controlling by data size and compared with the best individual datasets.
翻译:仇恨言论的自动检测是自然语言处理领域的一个活跃研究方向。迄今为止,大多数研究基于社交媒体数据集,这些数据集有助于训练仇恨言论检测模型。然而,数据创建过程存在固有偏见,模型也会从这些数据集特定的偏见中学习。本文进行了大规模跨数据集比较,我们在不同仇恨言论检测数据集上对语言模型进行微调。分析表明,某些数据集作为训练数据时比其他数据集具有更强的泛化能力。更重要的是,我们的实验显示,组合仇恨言论检测数据集有助于开发鲁棒的仇恨言论检测模型。这种鲁棒性在控制数据规模的情况下依然成立,且优于最佳单个数据集的表现。