Applications based on large language models (LLMs), such as multi-agent simulations, require population diversity among agents. We identify a pervasive failure mode we term \emph{Persona Collapse}: agents each assigned a distinct profile nonetheless converge into a narrow behavioral mode, producing a homogeneous simulated population. To quantify persona collapse, we propose a framework that measures how much of the persona space a population occupies (Coverage), how evenly agents spread across it (Uniformity), and how rich the resulting behavioral patterns are (Complexity). Evaluating ten LLMs on personality simulation (BFI-44), moral reasoning, and self-introduction, we observe persona collapse along two axes: (1) Dimensions: a model can appear diverse on one axis yet structurally degenerate on another, and (2) Domains: the same model may collapse the most in personality yet be the most diverse in moral reasoning. Furthermore, item-level diagnostics reveal that behavioral variation tracks coarse demographic stereotypes rather than the fine-grained individual differences specified in each persona. Counter-intuitively, \textbf{the models achieving the highest per-persona fidelity consistently produce the most stereotyped populations}. We release our toolkit and data to support population-level evaluation of LLMs.
翻译:基于大语言模型(LLM)的应用(如多智能体模拟)要求智能体具有种群多样性。我们发现一种普遍存在的失败模式,称之为“角色坍塌”:被分配不同设定档的智能体最终收敛到狭窄的行为模式,产生同质化的模拟种群。为量化角色坍塌,我们提出一个框架,通过测量种群占据角色空间的范围(覆盖率)、智能体在空间中分布的均匀程度(均匀性)以及由此产生的行为模式的丰富程度(复杂性)进行评估。在人格模拟(BFI-44)、道德推理和自我引介任务上对十个LLM进行评估后,我们观察到角色坍塌沿两个轴发生:(1)维度轴:模型在某一维度上可能表现多样,但在另一维度上结构退化;(2)领域轴:同一模型可能在人格模拟中最严重坍塌,却在道德推理中最多样化。此外,项目级诊断显示,行为变异追踪的是粗粒度的刻板印象,而非设定档中指定的细粒度个体差异。反直觉的是,**在单角色保真度上表现最佳的模型,始终生成最刻板化的种群**。我们发布相关工具包与数据,以支持对LLM进行种群级评估。