Large language models (LLMs) provide a compelling foundation for building generally-capable AI agents. These agents may soon be deployed at scale in the real world, representing the interests of individual humans (e.g., AI assistants) or groups of humans (e.g., AI-accelerated corporations). At present, relatively little is known about the dynamics of multiple LLM agents interacting over many generations of iterative deployment. In this paper, we examine whether a "society" of LLM agents can learn mutually beneficial social norms in the face of incentives to defect, a distinctive feature of human sociality that is arguably crucial to the success of civilization. In particular, we study the evolution of indirect reciprocity across generations of LLM agents playing a classic iterated Donor Game in which agents can observe the recent behavior of their peers. We find that the evolution of cooperation differs markedly across base models, with societies of Claude 3.5 Sonnet agents achieving significantly higher average scores than Gemini 1.5 Flash, which, in turn, outperforms GPT-4o. Further, Claude 3.5 Sonnet can make use of an additional mechanism for costly punishment to achieve yet higher scores, while Gemini 1.5 Flash and GPT-4o fail to do so. For each model class, we also observe variation in emergent behavior across random seeds, suggesting an understudied sensitive dependence on initial conditions. We suggest that our evaluation regime could inspire an inexpensive and informative new class of LLM benchmarks, focussed on the implications of LLM agent deployment for the cooperative infrastructure of society.
翻译:大语言模型为构建通用人工智能智能体提供了引人注目的基础。这些智能体可能很快会在现实世界中大规模部署,代表个体人类(例如AI助手)或人类群体(例如AI加速型企业)的利益。目前,对于多个大语言模型智能体在迭代部署的多代次中相互作用的动态机制,我们知之甚少。本文研究了一个大语言模型智能体"社会"是否能够在面临背叛激励的情况下习得互惠的社会规范——这是人类社会性的一项显著特征,对文明的成功至关重要。具体而言,我们研究了大语言模型智能体在多代次经典迭代捐赠者博弈中,间接互惠行为的演化过程,其中智能体能够观察同伴的近期行为。我们发现合作行为的演化在不同基础模型间存在显著差异:Claude 3.5 Sonnet智能体社会获得的平均分数显著高于Gemini 1.5 Flash,而后者又优于GPT-4o。此外,Claude 3.5 Sonnet能够利用额外的代价惩罚机制获得更高分数,而Gemini 1.5 Flash和GPT-4o则无法实现此机制。对于每个模型类别,我们还观察到不同随机种子下涌现行为的差异,这表明存在对初始条件敏感但尚未被充分研究的依赖性。我们认为,本评估机制可以启发一类新颖、经济且信息丰富的大语言模型基准测试,重点关注大语言模型智能体部署对社会合作基础设施的影响。