Understanding the world from distributed, partial viewpoints is a fundamental challenge for embodied multi-agent systems. Each agent perceives the environment through an ego-centric view that is often limited by occlusion and ambiguity. To study this problem, we introduce the Ego-to-World (E2W) benchmark, which evaluates a vision-language model's ability to fuse heterogeneous viewpoints across three tasks: (i) global counting, (ii) relational location reasoning, and (iii) action-oriented grasping that requires predicting view-specific image coordinates. To address this setting, we propose CoRL, a two-stage framework that combines Chain-of-Thought supervised fine-tuning with reinforcement learning using Group-Relative Policy Optimization. Its core component, the Cross-View Spatial Reward (CVSR), provides dense task-aligned feedback by linking reasoning steps to visual evidence, ensuring coherent cross-view entity resolution, and guiding the model toward correct final predictions. Experiments on E2W show that CoRL consistently surpasses strong proprietary and open-source baselines on both reasoning and perception-grounding metrics, while ablations further confirm the necessity of each CVSR component. Beyond that, CoRL generalizes to external spatial reasoning benchmarks and enables effective real-world multi-robot manipulation with calibrated multi-camera rigs, demonstrating cross-view localization and successful grasp-and-place execution. Together, E2W and CoRL provide a principled foundation for learning world-centric scene understanding from distributed, ego-centric observations, advancing collaborative embodied AI.
翻译:从分布式、局部视角理解世界是具身多智能体系统面临的一项根本性挑战。每个智能体通过以自我为中心的视角感知环境,这种视角常因遮挡和模糊性而受限。为研究此问题,我们提出了Ego-to-World(E2W)基准测试,用于评估视觉语言模型在以下三项任务中融合异构视角的能力:(i) 全局计数,(ii) 关系位置推理,以及 (iii) 需要预测视角特定图像坐标的面向动作的抓取任务。针对此设定,我们提出了CoRL,一个结合了思维链监督微调与使用组相对策略优化的强化学习的双阶段框架。其核心组件——跨视角空间奖励(CVSR)——通过将推理步骤与视觉证据相关联,提供密集的任务对齐反馈,确保连贯的跨视角实体解析,并引导模型做出正确的最终预测。在E2W上的实验表明,CoRL在推理和感知接地指标上均持续超越强大的专有及开源基线模型,而消融实验进一步证实了CVSR每个组件的必要性。此外,CoRL能够泛化至外部空间推理基准测试,并借助标定的多相机阵列实现有效的真实世界多机器人操作,展示了跨视角定位及成功的抓取-放置执行能力。综上,E2W与CoRL为从分布式、以自我为中心的观测中学习以世界为中心的场景理解提供了原则性基础,推动了协作式具身人工智能的发展。