Bias and fairness risks in Large Language Models (LLMs) vary substantially across deployment contexts, yet existing approaches lack systematic guidance for selecting appropriate evaluation metrics. We present a decision framework that maps LLM use cases, characterized by a model and population of prompts, to relevant bias and fairness metrics based on task type, whether prompts contain protected attribute mentions, and stakeholder priorities. Our framework addresses toxicity, stereotyping, counterfactual unfairness, and allocational harms, and introduces novel metrics based on stereotype classifiers and counterfactual adaptations of text similarity measures. We release an open-source Python library, \texttt{langfair}, for practical adoption. Extensive experiments on use cases across five LLMs and five prompt populations demonstrate that fairness risks cannot be reliably assessed from benchmark performance alone: results on one prompt dataset likely overstate or understate risks for another, underscoring that fairness evaluation must be grounded in the specific deployment context.
翻译:大语言模型中的偏见与公平性风险在不同部署场景中差异显著,而现有方法缺乏选择适当评估指标的系统性指导。我们提出一个决策框架,将基于模型和提示词群表征的大语言模型用例,依据任务类型、提示词是否包含受保护属性提及以及利益相关者优先级,映射至相关偏见与公平性指标。该框架涵盖毒性、刻板印象、反事实不公平性及分配性危害,并引入基于刻板印象分类器的新型指标及文本相似度测度的反事实改编。我们发布开源Python库\texttt{langfair}以促进实践应用。在五类大语言模型与五个提示词群构成的用例上进行的广泛实验表明,仅凭基准性能无法可靠评估公平性风险:某一提示数据集的结果很可能高估或低估另一数据集的风险,这凸显了公平性评估必须基于特定部署情境。