Self-evolution is a central research topic in enabling large language model (LLM)-based agents to continually improve their capabilities after pretraining. Recent research has witnessed a transition from reinforcement learning (RL)-free to RL-based methods. Current RL-based methods either rely on dense external reward signals or extract intrinsic reward signals from LLMs themselves. However, these approaches diverge from the self-evolution mechanisms observed in human intelligence, where individuals learn and improve through mutual discussion and collaboration. In this work, we introduce Co-Evolving Multi-Agent Systems (CoMAS), a novel framework that enables agents to improve autonomously by learning from inter-agent interactions without external supervision. CoMAS generates intrinsic rewards from rich discussion dynamics, employs an LLM-as-a-judge mechanism to formulate these rewards, and optimizes each agent's policy through RL, thereby enabling decentralized and scalable co-evolution. Experimental results demonstrate that CoMAS consistently outperforms untrained agents and achieves state-of-the-art performance across most evaluation settings. Ablation studies confirm the necessity of interaction-based reward signals and reveal promising scalability as the number and diversity of agents increase. These findings establish CoMAS as a novel and effective paradigm for self-evolution in LLM-based agents.
翻译:自我演化是使基于大语言模型(LLM)的智能体在预训练后持续提升其能力的核心研究课题。近期研究见证了从无强化学习(RL)方法向基于强化学习方法的转变。当前的基于强化学习的方法要么依赖于密集的外部奖励信号,要么从大语言模型自身提取内在奖励信号。然而,这些方法偏离了在人类智能中观察到的自我演化机制,即个体通过相互讨论与协作进行学习和改进。在本工作中,我们提出了协同演化多智能体系统(CoMAS),这是一个新颖的框架,使智能体能够在没有外部监督的情况下,通过从智能体间的交互中学习来自主改进。CoMAS从丰富的讨论动态中生成内在奖励,采用LLM-as-a-judge机制来构建这些奖励,并通过强化学习优化每个智能体的策略,从而实现去中心化且可扩展的协同演化。实验结果表明,CoMAS持续优于未经训练的智能体,并在大多数评估设置中实现了最先进的性能。消融研究证实了基于交互的奖励信号的必要性,并揭示了随着智能体数量与多样性增加而展现出的良好可扩展性。这些发现确立了CoMAS作为基于大语言模型的智能体自我演化的一种新颖且有效的范式。