Measurement of social bias in language models is typically by token probability (TP) metrics, which are broadly applicable but have been criticized for their distance from real-world language model use cases and harms. In this work, we test natural language inference (NLI) as an alternative bias metric. In extensive experiments across seven LM families, we show that NLI and TP bias evaluation behave substantially differently, with very low correlation among different NLI metrics and between NLI and TP metrics. NLI metrics are more brittle and unstable, slightly less sensitive to wording of counterstereotypical sentences, and slightly more sensitive to wording of tested stereotypes than TP approaches. Given this conflicting evidence, we conclude that neither token probability nor natural language inference is a ``better'' bias metric in all cases. We do not find sufficient evidence to justify NLI as a complete replacement for TP metrics in bias evaluation.
翻译:语言模型中社会偏见的测量通常采用词元概率(TP)度量方法,该方法具有广泛适用性,但因其与真实世界语言模型使用场景及危害的关联性较弱而受到批评。本研究测试了将自然语言推理(NLI)作为替代性偏见度量指标的可行性。通过对七个语言模型家族的广泛实验,我们发现NLI与TP偏见评估表现出显著差异:不同NLI度量之间以及NLI与TP度量之间的相关性极低。相较于TP方法,NLI度量更具脆弱性和不稳定性,对反刻板印象句式的措辞变化稍欠敏感,而对测试刻板印象的措辞变化则略显敏感。基于这些相互矛盾的证据,我们认为无论是词元概率还是自然语言推理,均非所有场景下的“更优”偏见度量指标。我们未发现足够证据支持在偏见评估中用NLI完全替代TP度量方法。