Recent work, spanning from autonomous vehicle coordination to in-space assembly, has shown the importance of learning collaborative behavior for enabling robots to achieve shared goals. A common approach for learning this cooperative behavior is to utilize the centralized-training decentralized-execution paradigm. However, this approach also introduces a new challenge: how do we evaluate the contributions of each agent's actions to the overall success or failure of the team. This credit assignment problem has remained open, and has been extensively studied in the Multi-Agent Reinforcement Learning literature. In fact, humans manually inspecting agent behavior often generate better credit evaluations than existing methods. We combine this observation with recent works which show Large Language Models demonstrate human-level performance at many pattern recognition tasks. Our key idea is to reformulate credit assignment to the two pattern recognition problems of sequence improvement and attribution, which motivates our novel LLM-MCA method. Our approach utilizes a centralized LLM reward-critic which numerically decomposes the environment reward based on the individualized contribution of each agent in the scenario. We then update the agents' policy networks based on this feedback. We also propose an extension LLM-TACA where our LLM critic performs explicit task assignment by passing an intermediary goal directly to each agent policy in the scenario. Both our methods far outperform the state-of-the-art on a variety of benchmarks, including Level-Based Foraging, Robotic Warehouse, and our new Spaceworld benchmark which incorporates collision-related safety constraints. As an artifact of our methods, we generate large trajectory datasets with each timestep annotated with per-agent reward information, as sampled from our LLM critics.
翻译:近期研究,从自动驾驶车辆协调到空间组装,已证明学习协作行为对于使机器人实现共同目标的重要性。学习这种合作行为的常见方法是采用集中训练分散执行的范式。然而,这种方法也带来了一个新的挑战:如何评估每个智能体的行为对团队整体成功或失败的贡献。这一信用分配问题一直悬而未决,并在多智能体强化学习领域得到了广泛研究。事实上,人工检查智能体行为通常能生成比现有方法更好的信用评估。我们将这一观察与近期研究表明大型语言模型在许多模式识别任务上表现出人类水平性能的工作相结合。我们的核心思想是将信用分配重新表述为序列改进和归因这两个模式识别问题,这启发了我们新颖的LLM-MCA方法。我们的方法利用一个集中式LLM奖励评判器,根据场景中每个智能体的个体化贡献,对环境奖励进行数值分解。然后,我们基于此反馈更新智能体的策略网络。我们还提出了一个扩展方法LLM-TACA,其中我们的LLM评判器通过向场景中每个智能体策略直接传递中间目标来执行明确的任务分配。我们的两种方法在多种基准测试中均远超现有技术水平,包括基于等级的觅食、机器人仓库以及我们新的Spaceworld基准(该基准包含了与碰撞相关的安全约束)。作为我们方法的副产品,我们生成了大型轨迹数据集,其中每个时间步都标注了从我们的LLM评判器采样的每个智能体的奖励信息。