Multi-robot collaboration allows robots to efficiently take on a wide range of tasks, from moving a couch through a doorway to assembling structures on a construction site. However, achieving such coordination in mobile multi-robot settings remains challenging: centralized methods conditioned on the combined observations of a team scale poorly with team size, and decentralized methods that train one policy per robot often require explicit alignment procedures or information sharing at inference time to overcome partial observability. Our key insight is that the visuomotor priors of pretrained vision-language-action (VLA) models should enable reactive, decentralized collaboration from each robot's local observations alone, without these inference-time assumptions. We propose CHORUS, a framework that adapts a single VLA backbone to control diverse, multi-robot teams. At inference time, each robot runs an independent copy of CHORUS, conditioned only on its own observations and a robot-identifying prompt. In real-world experiments including mobile tape measurement, library book handovers, and laundry basket lifting, CHORUS achieves a 64% point improvement over decentralized, from-scratch models, improves reactivity to teammate behavior by 40% points, and outperforms centralized baselines. Together, these results show that a shared VLA backbone is capable of achieving decentralized multi-robot collaboration, without per-robot policies or inter-robot communication at inference.
翻译:多机器人协作使机器人能够高效完成从搬运沙发通过门道到在建筑工地组装结构等一系列任务。然而,在移动多机器人环境下实现这种协调仍然具有挑战性:基于团队联合观测的集中式方法会随团队规模扩大而扩展性下降,而为每个机器人单独训练策略的去中心化方法通常需要明确的校准流程或推理时的信息共享来克服局部可观测性。我们的关键洞察是:预训练的视觉-语言-动作(VLA)模型的视觉运动先验应能使每个机器人仅依赖其局部观测实现响应式的去中心化协作,无需这些推理时的假设。我们提出CHORUS框架,该框架适配单一VLA骨干网络以控制多样化的多机器人团队。在推理时,每个机器人独立运行CHORUS副本,仅基于自身观测和机器人标识提示进行决策。在包含移动卷尺测量、图书馆书籍交接和洗衣篮搬举的真实世界实验中,CHORUS相比从零训练的去中心化方法实现了64%的性能提升,对队友行为的响应能力提高40%,并优于集中式基线方法。这些结果共同表明:共享VLA骨干网络能够实现去中心化的多机器人协作,无需为每个机器人单独制定策略,也无需在推理时进行机器人间通信。