As large language models (LLMs) are increasingly deployed in real-world applications, ensuring their fair responses across demographics has become crucial. Despite many efforts, an ongoing challenge is hidden bias: LLMs appear fair under standard evaluations, but can produce biased responses outside those evaluation settings. In this paper, we identify framing -- differences in how semantically equivalent prompts are expressed (e.g., "A is better than B" vs. "B is worse than A") -- as an underexplored contributor to this gap. We first introduce the concept of "framing disparity" to quantify the impact of framing on fairness evaluation. By augmenting fairness evaluation benchmarks with alternative framings, we find that (1) fairness scores vary significantly with framing and (2) existing debiasing methods improve overall (i.e., frame-averaged) fairness, but often fail to reduce framing-induced disparities. To address this, we propose a framing-aware debiasing method that encourages LLMs to be more consistent across framings. Experiments demonstrate that our approach reduces overall bias and improves robustness against framing disparities, enabling LLMs to produce fairer and more consistent responses.
翻译:随着大型语言模型(LLMs)在现实应用中的日益普及,确保其在不同人口群体间产生公平响应变得至关重要。尽管已有诸多努力,一个持续存在的挑战是隐藏偏差:LLMs在标准评估下看似公平,但在这些评估场景之外可能产生有偏见的响应。本文发现,框架效应——语义等价提示在表达方式上的差异(例如“A优于B”与“B劣于A”)——是导致这一差距的未充分探索因素。我们首先引入“框架差异”概念以量化框架效应对公平性评估的影响。通过在公平性评估基准中引入替代性框架表达,我们发现:(1)公平性评分随框架表达显著波动;(2)现有去偏方法虽能提升整体(即框架平均)公平性,但往往无法减少框架效应引发的差异。为此,我们提出一种框架感知的去偏方法,通过激励LLMs在不同框架下保持更高一致性。实验表明,该方法能有效降低整体偏见并提升对框架差异的鲁棒性,使LLMs能够生成更公平、更一致的响应。