The growing integration of large language models (LLMs) into social operations amplifies their impact on decisions in crucial areas such as economics, law, education, and healthcare, raising public concerns about these models' discrimination-related safety and reliability. However, prior discrimination measuring frameworks solely assess the average discriminatory behavior of LLMs, often proving inadequate due to the overlook of an additional discrimination-leading factor, i.e., the LLMs' prediction variation across diverse contexts. In this work, we present the Prejudice-Caprice Framework (PCF) that comprehensively measures discrimination in LLMs by considering both their consistently biased preference and preference variation across diverse contexts. Specifically, we mathematically dissect the aggregated contextualized discrimination risk of LLMs into prejudice risk, originating from LLMs' persistent prejudice, and caprice risk, stemming from their generation inconsistency. In addition, we utilize a data-mining approach to gather preference-detecting probes from sentence skeletons, devoid of attribute indications, to approximate LLMs' applied contexts. While initially intended for assessing discrimination in LLMs, our proposed PCF facilitates the comprehensive and flexible measurement of any inductive biases, including knowledge alongside prejudice, across various modality models. We apply our discrimination-measuring framework to 12 common LLMs, yielding intriguing findings: i) modern LLMs demonstrate significant pro-male stereotypes, ii) LLMs' exhibited discrimination correlates with several social and economic factors, iii) prejudice risk dominates the overall discrimination risk and follows a normal distribution, and iv) caprice risk contributes minimally to the overall risk but follows a fat-tailed distribution, suggesting that it is wild risk requiring enhanced surveillance.
翻译:大语言模型(LLMs)在社会运作中的日益融合,放大了其在经济、法律、教育和医疗等关键领域决策中的影响,引发了公众对这些模型歧视相关安全性和可靠性的担忧。然而,先前的歧视测量框架仅评估LLMs的平均歧视行为,由于忽视了另一个导致歧视的因素(即LLMs在不同情境下的预测波动),往往证明是不充分的。在这项工作中,我们提出了偏见-波动框架(PCF),通过同时考虑模型的一致性偏见偏好及其在不同情境下的偏好波动,全面测量LLMs中的歧视。具体而言,我们从数学上将LLMs的聚合情境化歧视风险分解为偏见风险(源于LLMs的持久偏见)和波动风险(源于其生成不一致性)。此外,我们采用数据挖掘方法,从句子骨架中提取无属性指示的偏好探测探针,以近似LLMs的应用情境。虽然最初用于评估LLMs中的歧视,但我们的PCF框架促进了跨多种模态模型对任何归纳偏差(包括知识与偏见)的全面且灵活的测量。我们将歧视测量框架应用于12个常见LLMs,产生了有趣发现:i) 现代LLMs展现出显著的重男轻女刻板印象,ii) LLMs表现出的歧视与多种社会经济因素相关,iii) 偏见风险主导整体歧视风险并遵循正态分布,iv) 波动风险对整体风险贡献极小,但遵循厚尾分布,表明这是一种需要加强监控的野性风险。