Large language models (LLMs) are increasingly deployed to simulate human collective behaviors, yet the methodological rigor of these "AI societies" remains under-explored. Through a systematic audit of 42 recent studies, we identify six pervasive flaws-spanning agent profiles, interaction, memory, control, unawareness, and realism (PIMMUR). Our analysis reveals that 90.7% of studies violate at least one principle, undermining simulation validity. We demonstrate that frontier LLMs correctly identify the underlying social experiment in 47.6% of cases, while 65.3% of prompts exert excessive control that pre-determines outcomes. By reproducing five representative experiments (e.g., telephone game), we show that reported collective phenomena often vanish or reverse when PIMMUR principles are enforced, suggesting that many "emergent" behaviors are methodological artifacts rather than genuine social dynamics. Our findings suggest that current AI simulations may capture model-specific biases rather than universal human social behaviors, raising critical concerns about the use of LLMs as scientific proxies for human society.
翻译:大型语言模型(LLM)正被越来越多地用于模拟人类集体行为,然而这些“AI社会”的方法论严谨性仍未得到充分探究。通过对近期42项研究的系统审查,我们识别出六类普遍存在的缺陷——涵盖智能体画像、交互、记忆、控制、无意识性与现实性(PIMMUR)。分析表明,90.7%的研究至少违反其中一项原则,从而削弱了模拟的有效性。我们发现前沿LLM仅在47.6%的案例中能正确识别底层社会实验,而65.3%的提示词施加了过度控制从而预设了结果。通过复现五项代表性实验(如传话游戏),我们证明当遵循PIMMUR原则时,所报告的集体现象往往消失或逆转,这表明许多“涌现”行为实质上是方法论产物而非真实的社会动态。我们的研究结果表明,当前AI模拟可能捕捉的是模型特定偏见而非普适的人类社会行为,这对将LLM作为人类社会的科学代理工具提出了重要警示。