When Does Demographic Information Help? Data and Modeling Regimes for Perspective-Aware Hate Speech Detection

Demographic information is often used to model annotator perspectives in subjective tasks such as hate speech detection, but its benefit is inconsistent: it improves performance in some settings and behaves as noise in others. This paper asks when demographic features help. We analyze demographic gain as a function of both data split properties and modeling frameworks. For data splits, we measure annotator disagreement, namely how often annotators assign different labels to the same example, along with training size and train-test demographic coverage. We find that demographic gains concentrate in regimes with low training disagreement, high test disagreement, fine-grained ambiguity measurement, sufficient training data, and greater demographic overlap. Motivated by these regimes, we introduce a gated demographic residual model that treats demographics as a selective adjustment to text-only predictions. Experiments on MHS and POPQUORN show that this design is effective, especially on high disagreement or low confidence examples. Overall, our results suggest that demographics should not be assumed useful by default; their value depends jointly on the data regime and the modeling framework.

翻译：人口统计信息常被用于建模主观任务（如仇恨言论检测）中标注者的视角，但其效果并不一致：在某些设置下能提升性能，而在其他情况下则表现为噪声。本文探讨人口统计特征何时有效。我们分析人口统计增益随数据划分属性和建模框架的变化关系。对于数据划分，我们测量标注者分歧度（即同一示例被不同标注者赋予差异标签的频率）、训练集规模以及训练-测试集的人口统计覆盖度。研究发现，人口统计增益集中于以下场景：训练集分歧度低、测试集分歧度高、歧义性细粒度测量、训练数据充足、人口统计重叠度高。基于这些规律，我们提出一种门控人口统计残差模型，该模型将人口统计信息视为文本预测的选择性调整。在MHS和POPQUORN数据集上的实验表明，这种设计尤其适用于高分歧或低置信度样本。整体而言，我们的结果表明：人口统计信息不应默认视为有效，其价值取决于数据机制和建模框架的共同作用。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

大模型如何利用数据？北大华为等最新《大型语言模型的数据管理》综述

专知会员服务

99+阅读 · 2023年12月6日