Though majority vote among annotators is typically used for ground truth labels in natural language processing, annotator disagreement in tasks such as hate speech detection may reflect differences in opinion across groups, not noise. Thus, a crucial problem in hate speech detection is determining whether a statement is offensive to the demographic group that it targets, when that group may constitute a small fraction of the annotator pool. We construct a model that predicts individual annotator ratings on potentially offensive text and combines this information with the predicted target group of the text to model the opinions of target group members. We show gains across a range of metrics, including raising performance over the baseline by 22% at predicting individual annotators' ratings and by 33% at predicting variance among annotators, which provides a metric for model uncertainty downstream. We find that annotator ratings can be predicted using their demographic information and opinions on online content, without the need to track identifying annotator IDs that link each annotator to their ratings. We also find that use of non-invasive survey questions on annotators' online experiences helps to maximize privacy and minimize unnecessary collection of demographic information when predicting annotators' opinions.
翻译:尽管在自然语言处理中通常使用标注者之间的多数投票作为真实标签,但仇恨言论检测等任务中的标注者分歧可能反映群体间的观点差异,而非噪声。因此,仇恨言论检测的一个关键问题在于:当被攻击的群体可能仅占标注者池的一小部分时,如何判定某言论是否对该目标群体具有冒犯性。我们构建了一个模型,用于预测单个标注者对潜在冒犯性文本的评分,并将该信息与文本的预测目标群体相结合,以模拟目标群体成员的观点。我们在多个指标上取得了提升,包括将预测单个标注者评分的性能较基线提升22%,预测标注者间方差的性能提升33%——后者为下游模型不确定性提供了度量标准。研究发现,标注者的评分可通过其人口统计信息和在线内容观点进行预测,无需追踪将每位标注者与其评分相关联的标识符。同时,我们发现在预测标注者观点时,采用关于标注者在线体验的非侵入性调查问卷有助于最大化隐私保护、最小化不必要的人口统计信息收集。