Hate speech detection is a crucial task, especially on social media where harmful content can spread quickly. Collecting social media content (tweets etc.) to train machine learning models is easy, but detecting and categorizing hate speech can be difficult due to the inherently subjective nature. This subjectivity leads to frequent disagreement among annotators, particularly for subtle or borderline content. Traditional approaches either discard non-consensus samples or force a ''gold standard'' through expert adjudication, ignoring valuable information about uncertainty and diverse human perspectives. We examine the largely overlooked problem of annotator disagreement in hate speech classification and evaluate a range of aggregation methods, including majority voting, ordinal strategies (minimum, maximum, and mean), and analyze their impact across binary, 4-class, and 6-class classification tasks. In addition, we leverage annotators' perceived hate speech strength scores to explore regression-based and hybrid modeling approaches. Among others, we show that filtering non-consensus samples results in over-optimistic results and that the perceived strength provides a complementary signal that enhance classification performance. Finally, we establish new state-of-the-art results for hate speech detection in Turkish tweets, and demonstrate that annotator disagreement, when properly modeled, is a valuable resource for building more robust and reliable systems.
翻译:仇恨言论检测是一项关键任务,尤其在社交媒体上,有害内容可能迅速传播。收集社交媒体内容(如推文等)以训练机器学习模型较为容易,但由于仇恨言论固有的主观性,其检测与分类存在困难。这种主观性导致标注者之间频繁出现分歧,尤其对于微妙或边缘性内容。传统方法要么丢弃非共识样本,要么通过专家裁决强制设定“黄金标准”,从而忽略了关于不确定性和多样化人类视角的宝贵信息。我们研究了仇恨言论分类中长期被忽视的标注者分歧问题,并评估了一系列聚合方法,包括多数投票、序数策略(最小值、最大值和均值),分析了它们在二分类、四分类和六分类任务中的影响。此外,我们利用标注者感知的仇恨言论强度分数,探索了基于回归和混合建模的方法。其中,我们证明过滤非共识样本会导致过于乐观的结果,而感知强度提供了可增强分类性能的互补信号。最终,我们为土耳其语推文的仇恨言论检测建立了新的最先进结果,并证明当适当建模时,标注者分歧是构建更稳健、更可靠系统的宝贵资源。