Cooperation and Exploitation in LLM Policy Synthesis for Sequential Social Dilemmas

We study LLM policy synthesis: using a large language model to iteratively generate programmatic agent policies for multi-agent environments. Rather than training neural policies via reinforcement learning, our framework prompts an LLM to produce Python policy functions, evaluates them in self-play, and refines them using performance feedback across iterations. We investigate feedback engineering (the design of what evaluation information is shown to the LLM during refinement) comparing sparse feedback (scalar reward only) against dense feedback (reward plus social metrics: efficiency, equality, sustainability, peace). Across two canonical Sequential Social Dilemmas (Gathering and Cleanup) and two frontier LLMs (Claude Sonnet 4.6, Gemini 3.1 Pro), dense feedback consistently matches or exceeds sparse feedback on all metrics. The advantage is largest in the Cleanup public goods game, where providing social metrics helps the LLM calibrate the costly cleaning-harvesting tradeoff. Rather than triggering over-optimization of fairness, social metrics serve as a coordination signal that guides the LLM toward more effective cooperative strategies, including territory partitioning, adaptive role assignment, and the avoidance of wasteful aggression. We further perform an adversarial experiment to determine whether LLMs can reward hack these environments. We characterize five attack classes and discuss mitigations, highlighting an inherent tension in LLM policy synthesis between expressiveness and safety. Code at https://github.com/vicgalle/llm-policies-social-dilemmas.

翻译：我们研究了LLM策略合成：使用大语言模型为多智能体环境迭代生成程序化智能体策略。不同于通过强化学习训练神经策略，我们的框架提示LLM生成Python策略函数，在自我对弈中评估这些策略，并通过迭代间的性能反馈对其进行优化。我们探究了反馈工程（即在优化过程中向LLM展示何种评估信息的设计），比较了稀疏反馈（仅标量奖励）与密集反馈（奖励加社会指标：效率、平等、可持续性、和平）。在两个经典序贯社会困境（采集与清理）和两个前沿LLM（Claude Sonnet 4.6、Gemini 3.1 Pro）中，密集反馈在所有指标上始终达到或超过稀疏反馈。在清理公共品博弈中优势最为显著，此处提供社会指标有助于LLM校准成本高昂的清理与采集权衡。社会指标并非引发对公平性的过度优化，而是作为协调信号，引导LLM形成更有效的合作策略，包括领地划分、适应性角色分配以及避免无谓的对抗。我们进一步进行了对抗性实验，以确定LLM是否能在这些环境中进行奖励操纵。我们归纳了五类攻击策略并讨论了缓解方法，凸显了LLM策略合成在表达性与安全性之间的固有张力。代码见https://github.com/vicgalle/llm-policies-social-dilemmas。