Humans can rearrange objects in cluttered environments using egocentric perception, navigating occlusions without global coordinates. Inspired by this capability, we study long-horizon multi-object non-prehensile rearrangement for mobile robots using a single egocentric camera. We introduce EgoPush, a policy learning framework that enables egocentric, perception-driven rearrangement without relying on explicit global state estimation that often fails in dynamic scenes. EgoPush designs an object-centric latent space to encode relative spatial relations among objects, rather than absolute poses. This design enables a privileged reinforcement-learning (RL) teacher to jointly learn latent states and mobile actions from sparse keypoints, which is then distilled into a purely visual student policy. To reduce the supervision gap between the omniscient teacher and the partially observed student, we restrict the teacher's observations to visually accessible cues. This induces active perception behaviors that are recoverable from the student's viewpoint. To address long-horizon credit assignment, we decompose rearrangement into stage-level subproblems using temporally decayed, stage-local completion rewards. Extensive simulation experiments demonstrate that EgoPush significantly outperforms end-to-end RL baselines in success rate, with ablation studies validating each design choice. We further demonstrate zero-shot sim-to-real transfer on a mobile platform in the real world. Code and videos are available at https://ai4ce.github.io/EgoPush/.
翻译:人类能够利用自我中心感知在杂乱环境中重排物体,无需全局坐标即可应对遮挡。受此能力启发,我们研究利用单一自我中心相机实现移动机器人的长时域多物体非抓取式重排。本文提出EgoPush——一种策略学习框架,能够在动态场景中不依赖常会失效的显式全局状态估计,实现以自我中心感知驱动的重排。EgoPush设计了以物体为中心的潜在空间,用于编码物体间的相对空间关系而非绝对位姿。该设计使特权强化学习(RL)教师能够从稀疏关键点联合学习潜在状态与移动动作,随后将其蒸馏至纯视觉学生策略中。为缩小全知教师与局部观测学生间的监督差距,我们将教师的观测限制在视觉可获取的线索范围内,从而诱导出可从学生视角恢复的主动感知行为。针对长时域信用分配问题,我们使用时序衰减的阶段局部完成奖励,将重排任务分解为阶段级子问题。大量仿真实验表明,EgoPush在成功率上显著优于端到端RL基线方法,消融实验验证了各项设计选择的有效性。我们进一步在真实世界移动平台上展示了零样本仿真到现实的迁移效果。代码与视频详见 https://ai4ce.github.io/EgoPush/。