Large language models (LLMs) are increasingly deployed across high-impact domains, from clinical decision support and legal analysis to hiring and education, making fairness and bias evaluation before deployment critical. However, existing evaluations lack grounding in real-world scenarios and do not account for differences in harm severity, e.g., a biased decision in surgery should not be weighed the same as a stylistic bias in text summarization. To address this gap, we introduce HALF (Harm-Aware LLM Fairness), a deployment-aligned framework that assesses model bias in realistic applications and weighs the outcomes by harm severity. HALF organizes nine application domains into three tiers (Severe, Moderate, Mild) using a five-stage pipeline. Our evaluation results across eight LLMs show that (1) LLMs are not consistently fair across domains, (2) model size or performance do not guarantee fairness, and (3) reasoning models perform better in medical decision support but worse in education. We conclude that HALF exposes a clear gap between previous benchmarking success and deployment readiness.
翻译:大语言模型正日益部署于临床决策支持、法律分析、招聘与教育等高影响力领域,这使得部署前的公平性与偏见评估变得至关重要。然而,现有评估方法缺乏现实场景的根基,且未考虑伤害严重程度的差异——例如,外科手术中的偏见决策不应与文本摘要中的风格偏见等量齐观。为弥补这一不足,我们提出了HALF(伤害感知的大语言模型公平性评估框架),这是一个与部署场景对齐的框架,用于评估模型在实际应用中的偏见,并根据伤害严重程度对结果进行加权。HALF通过五阶段流程,将九个应用领域划分为三个等级(严重、中等、轻微)。我们在八个大语言模型上的评估结果表明:(1)大语言模型在不同领域的公平性表现不一致;(2)模型规模或性能不能保证公平性;(3)推理模型在医疗决策支持中表现更好,但在教育领域表现更差。我们得出结论:HALF揭示了先前基准测试的成功与部署就绪度之间存在明显差距。