The widespread use of large language models has brought up essential questions about the potential biases these models might learn. This led to the development of several metrics aimed at evaluating and mitigating these biases. In this paper, we first demonstrate that prompt-based fairness metrics exhibit poor agreement, as measured by correlation, raising important questions about the reliability of fairness assessment using prompts. Then, we outline six relevant reasons why such a low correlation is observed across existing metrics. Based on these insights, we propose a method called Correlated Fairness Output (CAIRO) to enhance the correlation between fairness metrics. CAIRO augments the original prompts of a given fairness metric by using several pre-trained language models and then selects the combination of the augmented prompts that achieves the highest correlation across metrics. We show a significant improvement in Pearson correlation from 0.3 and 0.18 to 0.90 and 0.98 across metrics for gender and religion biases, respectively. Our code is available at https://github.com/chandar-lab/CAIRO.
翻译:大语言模型的广泛使用引发了关于这些模型可能习得潜在偏见的重要问题。这促使了多种旨在评估和缓解这些偏见的指标的发展。在本文中,我们首先证明,基于提示的公平性指标表现出较差的一致性(通过相关性衡量),这对使用提示进行公平性评估的可靠性提出了重要质疑。接着,我们概述了导致现有指标间观察到如此低相关性的六个相关原因。基于这些见解,我们提出了一种称为"相关公平性输出"(CAIRO)的方法,以增强公平性指标之间的相关性。CAIRO通过使用多个预训练语言模型来增强给定公平性指标的原始提示,然后选择能在不同指标间实现最高相关性的增强提示组合。我们展示了在性别和宗教偏见方面,跨指标的皮尔逊相关性分别从0.3和0.18显著提升至0.90和0.98。我们的代码可在 https://github.com/chandar-lab/CAIRO 获取。