The growing integration of large language models (LLMs) into social operations amplifies their impact on decisions in crucial areas such as economics, law, education, and healthcare, raising public concerns about these models' discrimination-related safety and reliability. However, prior discrimination measuring frameworks solely assess the average discriminatory behavior of LLMs, often proving inadequate due to the overlook of an additional discrimination-leading factor, i.e., the LLMs' prediction variation across diverse contexts. In this work, we present the Prejudice-Caprice Framework (PCF) that comprehensively measures discrimination in LLMs by considering both their consistently biased preference and preference variation across diverse contexts. Specifically, we mathematically dissect the aggregated contextualized discrimination risk of LLMs into prejudice risk, originating from LLMs' persistent prejudice, and caprice risk, stemming from their generation inconsistency. In addition, we utilize a data-mining approach to gather preference-detecting probes from sentence skeletons, devoid of attribute indications, to approximate LLMs' applied contexts. While initially intended for assessing discrimination in LLMs, our proposed PCF facilitates the comprehensive and flexible measurement of any inductive biases, including knowledge alongside prejudice, across various modality models. We apply our discrimination-measuring framework to 12 common LLMs, yielding intriguing findings: i) modern LLMs demonstrate significant pro-male stereotypes, ii) LLMs' exhibited discrimination correlates with several social and economic factors, iii) prejudice risk dominates the overall discrimination risk and follows a normal distribution, and iv) caprice risk contributes minimally to the overall risk but follows a fat-tailed distribution, suggesting that it is wild risk requiring enhanced surveillance.
翻译:随着大语言模型(LLMs)日益融入社会运作,其在经济、法律、教育和医疗等关键领域的决策影响力显著增强,引发了公众对这些模型歧视相关安全性与可靠性的担忧。然而,现有歧视度量框架仅评估LLMs的平均歧视行为,因忽视另一关键歧视因素——即LLMs在不同语境下的预测变异性——而往往效果有限。本文提出偏见-反复无常框架(PCF),通过同时考虑LLMs持续存在的偏好偏差及其在不同语境下的偏好变异,全面度量LLMs中的歧视。具体而言,我们从数学上将LLMs的聚合语境化歧视风险分解为源于模型持续偏见的偏见风险,以及源于生成不一致性的反复无常风险。此外,我们采用数据挖掘方法,从去除属性指示的句子骨架中提取偏好探测探针,以近似LLMs的应用语境。尽管我们的PCF框架最初旨在评估LLMs中的歧视,但它能够全面且灵活地度量任何模态模型中包括知识及偏见在内的归纳偏差。我们将该歧视度量框架应用于12种常见LLMs,取得了有趣发现:i) 现代LLMs表现出显著的男性偏见,ii) LLMs展现的歧视与若干社会经济因素相关,iii) 偏见风险主导总体歧视风险且服从正态分布,iv) 反复无常风险对总体风险贡献最小但服从厚尾分布,表明其为需强化监控的野性风险。