This study investigates why and how inconsistency in the generation of Large Language Models (LLMs) might induce or exacerbate societal injustice. For instance, LLMs frequently exhibit contrasting gender stereotypes regarding the same career depending on varied contexts, highlighting the arguably harmful unpredictability of LLMs' behavioral patterns. To augment the existing discrimination assessment with the capability to account for variation in LLM generation, we formulate the Prejudice-Volatility Framework (PVF) that precisely defines behavioral metrics for assessing LLMs, which delineate the probability distribution of LLMs' stereotypes from the perspective of token prediction probability. Specifically, we employ a data-mining approach to approximate the possible applied contexts of LLMs and devise statistical metrics to evaluate the corresponding contextualized societal discrimination risk. Further, we mathematically dissect the aggregated discrimination risk of LLMs into prejudice risk, originating from their system bias, and volatility risk, stemming from their generation inconsistency. While initially intended for assessing discrimination in LLMs, our proposed PVF facilitates the comprehensive and flexible measurement of any inductive biases, including knowledge alongside prejudice, across various modality models. We apply PVF to 12 most commonly adopted LLMs and compare their risk levels. Our findings reveal that: i) prejudice risk is the primary cause of discrimination risk in LLMs, indicating that inherent biases in these models lead to stereotypical outputs; ii) most LLMs exhibit significant pro-male stereotypes across nearly all careers; iii) alignment with Reinforcement Learning from Human Feedback lowers discrimination by reducing prejudice, but increases volatility; iv) discrimination risk in LLMs correlates with socio-economic factors like profession salaries.
翻译:本研究探讨了大语言模型(LLM)生成内容的不一致性为何及如何诱发或加剧社会不公。例如,LLM 常因语境不同而对同一职业表现出矛盾的性别刻板印象,这凸显了 LLM 行为模式可能有害的不可预测性。为增强现有歧视评估方法以考量 LLM 生成的变异性,我们构建了偏见-波动性框架(PVF),该框架精确定义了用于评估 LLM 的行为指标,从词元预测概率的角度描绘了 LLM 刻板印象的概率分布。具体而言,我们采用数据挖掘方法近似模拟 LLM 可能的应用语境,并设计统计指标以评估相应的情境化社会歧视风险。进一步地,我们通过数学方法将 LLM 的总体歧视风险分解为源于其系统偏见的偏见风险,以及源自其生成不一致性的波动性风险。尽管 PVF 最初旨在评估 LLM 中的歧视问题,但所提出的框架能够全面灵活地衡量包括知识与偏见在内的任何归纳偏差,并可扩展至多模态模型。我们将 PVF 应用于 12 个最常用的 LLM 并比较其风险水平。研究发现:i) 偏见风险是 LLM 歧视风险的主要成因,表明这些模型的内在偏见导致了刻板化输出;ii) 大多数 LLM 在几乎所有职业中都表现出显著的亲男性刻板印象;iii) 基于人类反馈的强化学习对齐通过降低偏见减少了歧视,但增加了波动性;iv) LLM 的歧视风险与职业薪资等社会经济因素相关。