Ego to World: Collaborative Spatial Reasoning in Embodied Systems via Reinforcement Learning

Heng Zhou,Li Kang,Yiran Qin,Xiufeng Song,Ao Yu,Zilu Zhang,Haoming Song,Kaixin Xu,Yuchen Fan,Dongzhan Zhou,Xiaohong Liu,Ruimao Zhang,Philip Torr,Lei Bai,Zhenfei Yin

Understanding the world from distributed, partial viewpoints is a fundamental challenge for embodied multi-agent systems. Each agent perceives the environment through an ego-centric view that is often limited by occlusion and ambiguity. To study this problem, we introduce the Ego-to-World (E2W) benchmark, which evaluates a vision-language model's ability to fuse heterogeneous viewpoints across three tasks: (i) global counting, (ii) relational location reasoning, and (iii) action-oriented grasping that requires predicting view-specific image coordinates. To address this setting, we propose CoRL, a two-stage framework that combines Chain-of-Thought supervised fine-tuning with reinforcement learning using Group-Relative Policy Optimization. Its core component, the Cross-View Spatial Reward (CVSR), provides dense task-aligned feedback by linking reasoning steps to visual evidence, ensuring coherent cross-view entity resolution, and guiding the model toward correct final predictions. Experiments on E2W show that CoRL consistently surpasses strong proprietary and open-source baselines on both reasoning and perception-grounding metrics, while ablations further confirm the necessity of each CVSR component. Beyond that, CoRL generalizes to external spatial reasoning benchmarks and enables effective real-world multi-robot manipulation with calibrated multi-camera rigs, demonstrating cross-view localization and successful grasp-and-place execution. Together, E2W and CoRL provide a principled foundation for learning world-centric scene understanding from distributed, ego-centric observations, advancing collaborative embodied AI.

翻译：从分布式、局部视角理解世界是具身多智能体系统面临的一项根本性挑战。每个智能体通过以自我为中心的视角感知环境，这种视角常因遮挡和模糊性而受限。为研究此问题，我们提出了Ego-to-World（E2W）基准测试，用于评估视觉语言模型在以下三项任务中融合异构视角的能力：(i) 全局计数，(ii) 关系位置推理，以及 (iii) 需要预测视角特定图像坐标的面向动作的抓取任务。针对此设定，我们提出了CoRL，一个结合了思维链监督微调与使用组相对策略优化的强化学习的双阶段框架。其核心组件——跨视角空间奖励（CVSR）——通过将推理步骤与视觉证据相关联，提供密集的任务对齐反馈，确保连贯的跨视角实体解析，并引导模型做出正确的最终预测。在E2W上的实验表明，CoRL在推理和感知接地指标上均持续超越强大的专有及开源基线模型，而消融实验进一步证实了CVSR每个组件的必要性。此外，CoRL能够泛化至外部空间推理基准测试，并借助标定的多相机阵列实现有效的真实世界多机器人操作，展示了跨视角定位及成功的抓取-放置执行能力。综上，E2W与CoRL为从分布式、以自我为中心的观测中学习以世界为中心的场景理解提供了原则性基础，推动了协作式具身人工智能的发展。

相关内容

CoRL

关注 0

CoRL的全程为Conference on Robot Learning（机器人学习大会），CoRL是一个新的以机器人学和机器学习为主题的年度国际会议。大会的组织者包括来自UC Berkrley、Google、Microsoft、CMU、MIT、ETH、Deepmind等知名院校和知名企业的研究者和从业者，同时CoRL大会的举办还得到了机器人国际机构“三巨头”之一的国际机器人研究基金会（IFRR）和机器学习领域最好的期刊之一JMLR（Journal of Machine Learning Research）的支持。

【综述】世界模型：架构、方法、推理与应用全景

专知会员服务

34+阅读 · 6月2日

从看见到认知世界：视觉世界模型综述

专知会员服务

17+阅读 · 5月17日

世界动作模型: 具身AI的下一个前沿

专知会员服务

23+阅读 · 5月13日

《图世界模型：概念、分类体系与未来方向》

专知会员服务

22+阅读 · 5月1日