We study whether multiple large language models (LLMs) can autonomously improve each other in a negotiation game by playing, reflecting, and criticizing. We are interested in this question because if LLMs were able to improve each other, it would imply the possibility of creating strong AI agents with minimal human intervention. We ask two LLMs to negotiate with each other, playing the roles of a buyer and a seller, respectively. They aim to reach a deal with the buyer targeting a lower price and the seller a higher one. A third language model, playing the critic, provides feedback to a player to improve the player's negotiation strategies. We let the two agents play multiple rounds, using previous negotiation history and AI feedback as in-context demonstrations to improve the model's negotiation strategy iteratively. We use different LLMs (GPT and Claude) for different roles and use the deal price as the evaluation metric. Our experiments reveal multiple intriguing findings: (1) Only a subset of the language models we consider can self-play and improve the deal price from AI feedback, weaker models either do not understand the game's rules or cannot incorporate AI feedback for further improvement. (2) Models' abilities to learn from the feedback differ when playing different roles. For example, it is harder for Claude-instant to improve as the buyer than as the seller. (3) When unrolling the game to multiple rounds, stronger agents can consistently improve their performance by meaningfully using previous experiences and iterative AI feedback, yet have a higher risk of breaking the deal. We hope our work provides insightful initial explorations of having models autonomously improve each other with game playing and AI feedback.
翻译:我们研究多个大型语言模型能否在谈判游戏中通过博弈、反思与互评实现自主提升。该问题的研究意义在于:若语言模型具备相互改进能力,则意味着可能以最小化人工干预构建强人工智能体。让两个语言模型分别扮演买方与卖方进行谈判,双方的目标是达成交易——买方追求低价,卖方追求高价。第三语言模型作为评审者,为参与者提供改进谈判策略的反馈。通过多轮博弈,模型利用历史谈判记录与AI反馈作为上下文示例,迭代优化谈判策略。不同角色使用不同语言模型(GPT与Claude),并以成交价为评估指标。实验揭示多项重要发现:(1)仅部分模型具备自我对弈能力并能通过AI反馈提升成交价,弱模型要么无法理解博弈规则,要么不能吸收AI反馈进行改进;(2)模型学习反馈的能力因角色而异,例如Claude-instant作为买方的提升难度大于作为卖方;(3)多轮博弈中,强智能体通过有效利用历史经验与迭代AI反馈可持续改进表现,但交易失败风险更高。本研究为探索模型通过博弈与AI反馈实现自主提升提供了初步范式参考。