There is an growing interest in using Large Language Models (LLMs) in multi-agent systems to tackle interactive real-world tasks that require effective collaboration and assessing complex situations. Yet, we still have a limited understanding of LLMs' communication and decision-making abilities in multi-agent setups. The fundamental task of negotiation spans many key features of communication, such as cooperation, competition, and manipulation potentials. Thus, we propose using scorable negotiation to evaluate LLMs. We create a testbed of complex multi-agent, multi-issue, and semantically rich negotiation games. To reach an agreement, agents must have strong arithmetic, inference, exploration, and planning capabilities while integrating them in a dynamic and multi-turn setup. We propose multiple metrics to rigorously quantify agents' performance and alignment with the assigned role. We provide procedures to create new games and increase games' difficulty to have an evolving benchmark. Importantly, we evaluate critical safety aspects such as the interaction dynamics between agents influenced by greedy and adversarial players. Our benchmark is highly challenging; GPT-3.5 and small models mostly fail, and GPT-4 and SoTA large models (e.g., Llama-3 70b) still underperform.
翻译:近年来,利用大型语言模型(LLMs)构建多智能体系统以处理需要有效协作和复杂情境评估的交互式现实任务日益受到关注。然而,我们对LLMs在多智能体环境中的沟通与决策能力仍知之甚少。谈判这一基础任务涵盖了沟通的诸多关键特征,例如合作、竞争及潜在的操纵行为。因此,我们提出使用可计分谈判来评估LLMs。我们创建了一个包含复杂多智能体、多议题且语义丰富的谈判游戏的测试平台。为达成协议,智能体必须具备强大的算术、推理、探索和规划能力,并能在动态多轮交互中综合运用这些能力。我们提出了多种指标,以严格量化智能体的性能及其与所分配角色的契合度。我们提供了创建新游戏及提升游戏难度的流程,从而构建一个持续演进的基准。重要的是,我们评估了关键的安全性问题,例如受贪婪和对抗性玩家影响的智能体间交互动态。我们的基准测试极具挑战性:GPT-3.5及小型模型大多失败,而GPT-4及当前最先进的大型模型(如Llama-3 70b)的表现仍然欠佳。