Powerful large language models (LLMs) from different providers have been expensively trained and finetuned to specialize across varying domains. In this work, we introduce a new kind of Conductor model trained with reinforcement learning to automatically discover powerful coordination strategies among LLMs. Our Conductor learns not only to design targeted communication topologies for effective agent-to-agent collaboration, but also to prompt engineer focused instructions to the LLMs to maximally leverage their individual capabilities. We show that, by learning optimal coordination strategies over pools of powerful worker LLMs, a 7B Conductor achieves significant performance gains beyond any individual worker, attaining state-of-the-art results in challenging reasoning benchmarks, such as LiveCodeBench and GPQA. By training with randomized agent pools, our conductor effectively adapts to arbitrary sets of open- and closed-source agents, meeting any user requirements. Furthermore, allowing the Conductor to select itself as a worker gives rise to recursive topologies, elevating performance with a new form of dynamic test-time scaling through online iterative adaptation. More broadly, ours is among the early work demonstrating language model coordination can be unlocked through RL, where powerful coordination strategies emerge naturally in LLMs through pure end-to-end reward maximization.
翻译:不同供应商的强大大型语言模型(LLM)经过昂贵的训练和微调,已在不同领域形成专长。本研究提出一种新型的Conductor模型,该模型通过强化学习训练,能够自动发现LLM间的高效协同策略。我们的Conductor不仅学会设计针对性的通信拓扑以实现有效的智能体间协作,还能通过提示工程向LLM生成聚焦式指令,以最大限度地发挥其个体能力。研究表明,通过学习对强大工作LLM池的最优协调策略,一个70亿参数的Conductor能够实现超越任何单个工作模型的显著性能提升,并在LiveCodeBench和GPQA等具有挑战性的推理基准测试中达到最先进水平。通过使用随机化智能体池进行训练,我们的Conductor能有效适配任意开源与闭源智能体组合,满足各类用户需求。此外,允许Conductor将自身选作工作节点可形成递归拓扑结构,通过在线迭代自适应的新型动态测试时扩展机制进一步提升性能。更广泛而言,本研究属于早期证明语言模型协调能力可通过强化学习解锁的工作之一,其中强大的协调策略通过纯粹的端到端奖励最大化过程自然涌现于LLM之中。