As large language models (LLMs) are adopted into frameworks that grant them the capacity to make real decisions, it is increasingly important to ensure that they are unbiased. In this paper, we argue that the predominant approach of simply removing existing biases from models is not enough. Using a paradigm from the psychology literature, we demonstrate that LLMs can spontaneously develop novel social biases about artificial demographic groups even when no inherent differences exist. These biases result in highly stratified task allocations, which are less fair than assignments by human participants and are exacerbated by newer and larger models. In social science, emergent biases like these have been shown to result from exploration-exploitation trade-offs, where the decision-maker explores too little, allowing early observations to strongly influence impressions about entire demographic groups. To alleviate this effect, we examine a series of interventions targeting model inputs, problem structure, and explicit steering. We find that explicitly incentivizing exploration most robustly reduces stratification, highlighting the need for better multifaceted objectives to mitigate bias. These results reveal that LLMs are not merely passive mirrors of human social biases, but can actively create new ones from experience, raising urgent questions about how these systems will shape societies over time.
翻译:随着大型语言模型(LLMs)被纳入赋予其实际决策能力的框架,确保其无偏性变得日益重要。本文认为,当前主流方法仅从模型中消除现有偏见是不够的。借鉴心理学文献中的研究范式,我们证明即使不存在内在差异,LLMs仍能对人工构建的人口群体自发产生新型社会偏见。这些偏见导致任务分配高度分层化,其公平性低于人类参与者的分配结果,且在新一代更大规模的模型中更为严重。社会科学研究表明,此类涌现性偏见源于探索-利用的权衡——决策者探索不足,使得早期观察强烈影响其对整个人口群体的认知。为缓解此效应,我们检验了针对模型输入、问题结构和显式引导的一系列干预措施。研究发现,显式激励探索能最稳健地降低分层化程度,这凸显了需要建立更完善的多维度目标以缓解偏见。这些结果表明,LLMs不仅是人类社会偏见的被动映射,更能从经验中主动创造新的偏见,这引发了关于此类系统将如何随时间塑造社会的紧迫问题。