Though majority vote among annotators is typically used for ground truth labels in natural language processing, annotator disagreement in tasks such as hate speech detection may reflect differences in opinion across groups, not noise. Thus, a crucial problem in hate speech detection is determining whether a statement is offensive to the demographic group that it targets, when that group may constitute a small fraction of the annotator pool. We construct a model that predicts individual annotator ratings on potentially offensive text and combines this information with the predicted target group of the text to model the opinions of target group members. We show gains across a range of metrics, including raising performance over the baseline by 22% at predicting individual annotators' ratings and by 33% at predicting variance among annotators, which provides a metric for model uncertainty downstream. We find that annotator ratings can be predicted using their demographic information and opinions on online content, without the need to track identifying annotator IDs that link each annotator to their ratings. We also find that use of non-invasive survey questions on annotators' online experiences helps to maximize privacy and minimize unnecessary collection of demographic information when predicting annotators' opinions.
翻译:尽管在自然语言处理中,标注者间的多数投票通常被用于获取真实标签,但在仇恨言论检测等任务中,标注者的分歧可能反映了群体间的观点差异,而非噪声。因此,仇恨言论检测中的一个关键问题是:当目标群体仅占标注者池的极小部分时,如何判断某条言论是否冒犯了其所针对的群体。我们构建了一个模型,该模型可预测个体标注者对潜在攻击性文本的评分,并将此信息与文本的预测目标群体相结合,以模拟目标群体成员的观点。我们展示了该模型在多项指标上的提升,包括将预测个体标注者评分的性能较基线提升22%,将预测标注者间方差的性能提升33%——后者为下游模型不确定性提供了度量标准。我们发现,利用标注者的人口统计信息及其对在线内容的观点即可预测其评分,无需追踪将每位标注者与其评分关联的标注者身份标识。我们还发现,使用关于标注者在线体验的非侵入性调查问题,有助于在预测标注者观点时最大化隐私保护并最小化不必要的人口统计信息收集。