Bias-measuring datasets play a critical role in detecting biased behavior of language models and in evaluating progress of bias mitigation methods. In this work, we focus on evaluating gender bias through coreference resolution, where previous datasets are either hand-crafted or fail to reliably measure an explicitly defined bias. To overcome these shortcomings, we propose a novel method to collect diverse, natural, and minimally distant text pairs via counterfactual generation, and construct Counter-GAP, an annotated dataset consisting of 4008 instances grouped into 1002 quadruples. We further identify a bias cancellation problem in previous group-level metrics on Counter-GAP, and propose to use the difference between inconsistency across genders and within genders to measure bias at a quadruple level. Our results show that four pre-trained language models are significantly more inconsistent across different gender groups than within each group, and that a name-based counterfactual data augmentation method is more effective to mitigate such bias than an anonymization-based method.
翻译:摘要:偏见度量数据集在检测语言模型的有偏行为及评估偏见缓解方法的进展中扮演着关键角色。本研究聚焦于通过共指消解评估性别偏见,而现有数据集要么是人工构建的,要么无法可靠地衡量明确定义的偏见。为克服这些不足,我们提出一种新方法,通过反事实生成收集多样、自然且最小差异的文本对,并构建了Counter-GAP——一个由4008个实例组成、按1002个四元组分组的有标注数据集。我们进一步发现了在Counter-GAP上使用先前群体级指标时存在的偏见抵消问题,并提出利用跨性别不一致性与组内不一致性之间的差值在四元组级别衡量偏见。我们的结果表明,四个预训练语言模型在不同性别群体间的表现显著比在各群体内部更不一致,且基于名称的反事实数据增强方法在缓解此类偏见上比基于匿名化的方法更有效。