The increased use of Large Language Models (LLMs) in sensitive domains leads to growing interest in how their confidence scores correspond to fairness and bias. This study examines the alignment between LLM-predicted confidence and human-annotated bias judgments. Focusing on gender bias, the research investigates probability confidence calibration in contexts involving gendered pronoun resolution. The goal is to evaluate if calibration metrics based on predicted confidence scores effectively capture fairness-related disparities in LLMs. The results show that, among the six state-of-the-art models, Gemma-2 demonstrates the worst calibration according to the gender bias benchmark. The primary contribution of this work is a fairness-aware evaluation of LLMs' confidence calibration, offering guidance for ethical deployment. In addition, we introduce a new calibration metric, Gender-ECE, designed to measure gender disparities in resolution tasks.
翻译:大语言模型在敏感领域应用的日益增多,引发了人们对其置信度分数与公平性及偏见之间关系的关注。本研究探讨了LLM预测置信度与人工标注偏见判断之间的一致性。研究聚焦于性别偏见,考察了涉及性别化代词消解语境中的概率置信度校准问题。其目标在于评估基于预测置信度分数的校准指标是否能有效捕捉大语言模型中与公平性相关的差异。结果显示,在六种前沿模型中,Gemma-2在性别偏见基准测试中表现出最差的校准性能。本研究的主要贡献在于提出了针对LLM置信度校准的公平性评估框架,为伦理部署提供了指导。此外,我们引入了一种新的校准指标——性别期望校准误差,该指标专为衡量消解任务中的性别差异而设计。