Existing studies on bias mitigation methods for large language models (LLMs) use diverse baselines and metrics to evaluate debiasing performance, leading to inconsistent comparisons among them. Moreover, their evaluations are mostly based on the comparison between LLMs' probabilities of biased and unbiased contexts, which ignores the gap between such evaluations and real-world use cases where users interact with LLMs by reading model responses and expect fair and safe outputs rather than LLMs' probabilities. To enable consistent evaluation across debiasing methods and bridge this gap, we introduce BiasFreeBench, an empirical benchmark that comprehensively compares eight mainstream bias mitigation techniques (covering four prompting-based and four training-based methods) on two test scenarios (multi-choice QA and open-ended multi-turn QA) by reorganizing existing datasets into a unified query-response setting. We further introduce a response-level metric, Bias-Free Score, to measure the extent to which LLM responses are fair, safe, and anti-stereotypical. Debiasing performances are systematically compared and analyzed across key dimensions: the prompting vs. training paradigm, model size, and generalization of different training strategies to unseen bias types. We release our benchmark, aiming to establish a unified testbed for bias mitigation research.
翻译:现有关于大语言模型去偏方法的研究采用多样化的基线指标与评估体系,导致不同方法间的性能对比缺乏一致性。此外,这些评估大多基于模型对偏见语境与非偏见语境概率输出的比较,忽视了此类评估与实际应用场景之间的差距:在真实使用中,用户通过阅读模型响应与大语言模型进行交互,并期待获得公平安全的输出内容,而非关注模型的内部概率分布。为建立去偏方法间的统一评估标准并弥合上述差距,本研究提出BiasFreeBench——一个通过重构现有数据集至统一查询-响应框架下的实证基准,在两种测试场景(多项选择问答与开放式多轮问答)上对八种主流去偏技术(涵盖四种基于提示工程的方法与四种基于训练的方法)进行全面比较。我们进一步提出响应级度量指标“无偏分数”,用于量化大语言模型响应在公平性、安全性与反刻板印象程度方面的表现。研究从核心维度系统对比分析了去偏性能:提示范式与训练范式的差异、模型规模的影响,以及不同训练策略对未见偏见类型的泛化能力。本基准已公开发布,旨在为偏见缓解研究建立统一的测试平台。