The growing integration of large language models (LLMs) into social operations amplifies their impact on decisions in crucial areas such as economics, law, education, and healthcare, raising public concerns about these models' discrimination-related safety and reliability. However, prior discrimination measuring frameworks solely assess the average discriminatory behavior of LLMs, often proving inadequate due to the overlook of an additional discrimination-leading factor, i.e., the LLMs' prediction variation across diverse contexts. In this work, we present the Prejudice-Caprice Framework (PCF) that comprehensively measures discrimination in LLMs by considering both their consistently biased preference and preference variation across diverse contexts. Specifically, we mathematically dissect the aggregated contextualized discrimination risk of LLMs into prejudice risk, originating from LLMs' persistent prejudice, and caprice risk, stemming from their generation inconsistency. In addition, we utilize a data-mining approach to gather preference-detecting probes from sentence skeletons, devoid of attribute indications, to approximate LLMs' applied contexts. While initially intended for assessing discrimination in LLMs, our proposed PCF facilitates the comprehensive and flexible measurement of any inductive biases, including knowledge alongside prejudice, across various modality models. We apply our discrimination-measuring framework to 12 common LLMs, yielding intriguing findings: i) modern LLMs demonstrate significant pro-male stereotypes, ii) LLMs' exhibited discrimination correlates with several social and economic factors, iii) prejudice risk dominates the overall discrimination risk and follows a normal distribution, and iv) caprice risk contributes minimally to the overall risk but follows a fat-tailed distribution, suggesting that it is wild risk requiring enhanced surveillance.
翻译:大型语言模型(LLMs)在社会运作中的日益融合,放大了其在经济、法律、教育、医疗等关键领域决策中的影响,引发了公众对这些模型在歧视相关安全性与可靠性方面的担忧。然而,以往的歧视测量框架仅评估LLMs的平均歧视行为,往往因忽视另一个导致歧视的因素——即LLMs在不同语境下的预测变异——而显得不足。本研究提出偏见-善变框架(Prejudice-Caprice Framework, PCF),通过同时考虑LLMs的持续偏差偏好以及其在多样化语境中的偏好变化,全面测量LLMs的歧视。具体而言,我们从数学上分解LLMs的情境化歧视风险,将其分为源自LLMs持久偏见的“偏见风险”和源于生成不一致性的“善变风险”。此外,我们利用数据挖掘方法,从去除属性指示的句子骨架中收集偏好检测探针,以近似LLMs的应用语境。尽管PCF最初旨在评估LLMs的歧视,但它促进了任何归纳偏见(包括知识及偏见)在多种模态模型中的全面灵活测量。我们将该歧视测量框架应用于12种常见LLMs,得出有趣发现:i) 现代LLMs表现出显著的男性刻板印象,ii) LLMs展现的歧视与若干社会及经济因素相关,iii) 偏见风险主导总体歧视风险且服从正态分布,iv) 善变风险对总体风险贡献极小但服从厚尾分布,表明这是一种需加强监控的野性风险。