Demographic information is often used to model annotator perspectives in subjective tasks such as hate speech detection, but its benefit is inconsistent: it improves performance in some settings and behaves as noise in others. This paper asks when demographic features help. We analyze demographic gain as a function of both data split properties and modeling frameworks. For data splits, we measure annotator disagreement, namely how often annotators assign different labels to the same example, along with training size and train-test demographic coverage. We find that demographic gains concentrate in regimes with low training disagreement, high test disagreement, fine-grained ambiguity measurement, sufficient training data, and greater demographic overlap. Motivated by these regimes, we introduce a gated demographic residual model that treats demographics as a selective adjustment to text-only predictions. Experiments on MHS and POPQUORN show that this design is effective, especially on high disagreement or low confidence examples. Overall, our results suggest that demographics should not be assumed useful by default; their value depends jointly on the data regime and the modeling framework.
翻译:人口统计信息常被用于建模主观任务(如仇恨言论检测)中标注者的视角,但其效果并不一致:在某些设置下能提升性能,而在其他情况下则表现为噪声。本文探讨人口统计特征何时有效。我们分析人口统计增益随数据划分属性和建模框架的变化关系。对于数据划分,我们测量标注者分歧度(即同一示例被不同标注者赋予差异标签的频率)、训练集规模以及训练-测试集的人口统计覆盖度。研究发现,人口统计增益集中于以下场景:训练集分歧度低、测试集分歧度高、歧义性细粒度测量、训练数据充足、人口统计重叠度高。基于这些规律,我们提出一种门控人口统计残差模型,该模型将人口统计信息视为文本预测的选择性调整。在MHS和POPQUORN数据集上的实验表明,这种设计尤其适用于高分歧或低置信度样本。整体而言,我们的结果表明:人口统计信息不应默认视为有效,其价值取决于数据机制和建模框架的共同作用。