EgoDemoGen: Egocentric Demonstration Generation for Viewpoint Generalization in Robotic Manipulation

Yuan Xu,Jiabing Yang,Xiaofeng Wang,Yixiang Chen,Zheng Zhu,Bowen Fang,Guan Huang,Xinze Chen,Yun Ye,Qiang Zhang,Peiyan Li,Xiangnan Wu,Kai Wang,Bing Zhan,Shuo Lu,Jing Liu,Nianfeng Liu,Yan Huang,Liang Wang

Imitation learning based visuomotor policies have achieved strong performance in robotic manipulation, yet they often remain sensitive to egocentric viewpoint shifts. Unlike third-person viewpoint changes that only move the camera, egocentric shifts simultaneously alter both the camera pose and the robot action coordinate frame, making it necessary to jointly transfer action trajectories and synthesize corresponding observations under novel egocentric viewpoints. To address this challenge, we present EgoDemoGen, a framework that generates paired observation--action demonstrations under novel egocentric viewpoints through two key components: 1{)} EgoTrajTransfer, which transfers robot trajectories to the novel egocentric coordinate frame through motion-skill segmentation, geometry-aware transformation, and inverse kinematics filtering; and 2{)} EgoViewTransfer, a conditional video generation model that fuses a novel-viewpoint reprojected scene video and a robot motion video rendered from the transferred trajectory to synthesize photorealistic observations, trained with a self-supervised double reprojection strategy without requiring multi-viewpoint data. Experiments in simulation and real-world settings show that EgoDemoGen consistently improves policy success rates under both standard and novel egocentric viewpoints, with absolute gains of +24.6\% and +16.9\% in simulation and +16.0\% and +23.0\% on the real robot. Moreover, EgoViewTransfer achieves superior video generation quality for novel egocentric observations.

翻译：基于模仿学习的视觉运动策略在机器人操作中取得了显著性能，但其通常对自我中心视点变化敏感。与仅移动相机的第三人称视点变化不同，自我中心视点变化会同时改变相机姿态和机器人动作坐标系，因此需要在新的自我中心视点下联合迁移动作轨迹并合成对应的观测数据。为应对该挑战，我们提出EgoDemoGen框架，通过两个关键组件生成新自我中心视点下的配对观测-动作演示：1）EgoTrajTransfer——通过运动技能分割、几何感知变换和逆运动学滤波将机器人轨迹迁移至新自我中心坐标系；2）EgoViewTransfer——一种条件视频生成模型，融合新视点重投影场景视频与从迁移轨迹渲染的机器人运动视频，以合成逼真观测数据，并采用无需多视点数据的自监督双重重投影策略进行训练。仿真与真实世界实验表明，EgoDemoGen在标准和新自我中心视点下均能稳定提升策略成功率：仿真环境绝对增益达+24.6%与+16.9%，真实机器人环境达+16.0%与+23.0%。此外，EgoViewTransfer在新自我中心观测视频生成质量上表现优异。