Hate speech represents a pervasive and detrimental form of online discourse, often manifested through an array of slurs, from hateful tweets to defamatory posts. As such speech proliferates, it connects people globally and poses significant social, psychological, and occasionally physical threats to targeted individuals and communities. Current computational linguistic approaches for tackling this phenomenon rely on labelled social media datasets for training. For unifying efforts, our study advances in the critical need for a comprehensive meta-collection, advocating for an extensive dataset to help counteract this problem effectively. We scrutinized over 60 datasets, selectively integrating those pertinent into MetaHate. This paper offers a detailed examination of existing collections, highlighting their strengths and limitations. Our findings contribute to a deeper understanding of the existing datasets, paving the way for training more robust and adaptable models. These enhanced models are essential for effectively combating the dynamic and complex nature of hate speech in the digital realm.
翻译:仇恨言论是一种普遍且有害的网络话语形式,常表现为一系列辱骂性内容,从恶意推文到毁谤性帖子不一而足。随着此类言论的扩散,它既连接全球用户,又对目标个人及社群构成严重的社会、心理乃至生理威胁。当前应对这一现象的计算语言学方法依赖于带标注的社交媒体数据集进行训练。为统一各方工作,本研究推进了一项关键需求——构建综合性元数据集,倡导通过大规模数据集有效遏制这一问题。我们审阅了60余个数据集,选择性整合其中相关部分形成MetaHate。本文对现有数据集进行了详细考察,突出其优势与局限。研究结果有助于深入理解现有数据集,为训练更稳健、适应性更强的模型铺平道路。这些增强型模型对于有效对抗数字领域中仇恨言论的动态性与复杂性至关重要。