There is a growing interest in using Large Language Models (LLMs) as agents to tackle real-world tasks that may require assessing complex situations. Yet, we have a limited understanding of LLMs' reasoning and decision-making capabilities, partly stemming from a lack of dedicated evaluation benchmarks. As negotiating and compromising are key aspects of our everyday communication and collaboration, we propose using scorable negotiation games as a new evaluation framework for LLMs. We create a testbed of diverse text-based, multi-agent, multi-issue, semantically rich negotiation games, with easily tunable difficulty. To solve the challenge, agents need to have strong arithmetic, inference, exploration, and planning capabilities, while seamlessly integrating them. Via a systematic zero-shot Chain-of-Thought prompting (CoT), we show that agents can negotiate and consistently reach successful deals. We quantify the performance with multiple metrics and observe a large gap between GPT-4 and earlier models. Importantly, we test the generalization to new games and setups. Finally, we show that these games can help evaluate other critical aspects, such as the interaction dynamics between agents in the presence of greedy and adversarial players.
翻译:摘要:大语言模型(LLMs)作为智能体处理需评估复杂情境的实际任务正日益引发关注。然而,我们对LLMs推理与决策能力的认知仍有限,部分原因在于缺乏专门的评估基准。鉴于协商与妥协是日常沟通协作的核心环节,我们提出将可量化博弈游戏作为LLMs的新型评估框架。我们构建了一个由多样化文本型、多智能体、多议题且语义丰富的协商游戏组成的测试平台,其难度可灵活调节。为攻克该挑战,智能体需具备强大的算术、推理、探索与规划能力,并实现无缝集成。通过系统性零样本思维链提示(CoT),我们证明智能体能够开展协商并持续达成成功协议。我们采用多维度指标量化性能,观察到GPT-4与早期模型间存在显著差距。重要的是,我们测试了模型对新型游戏与设置场景的泛化能力。最后,研究表明此类游戏亦可评估其他关键维度,例如在存在贪婪型与对抗型智能体时的智能体交互动态。