R2RDreamer: 3D-aware Data Augmentation for Spatially-generalized 2D Manipulation Policies

Spatial generalization is critical for imitation-learned manipulation policies, but achieving it typically requires scaling demonstrations across diverse object poses, robot configurations, and camera viewpoints. Data augmentation from a few source demonstrations offers a practical alternative to costly real-world collection. Simulation-based augmentation can create controllable variation, but requires complex environment and object setup and may introduce a sim-to-real gap. Recent real-to-real methods avoid these issues by jointly editing 3D observations and action trajectories from real demonstrations, yet they still rely on strong 3D scene parsing and geometry completion, and often produce observations tailored to 3D pointcloud policies rather than RGB-based 2D policies. We propose R2RDreamer, a real-to-real demonstration augmentation framework that preserves the geometric consistency of 3D action-observation editing while moving visual completion to 2D video space. Specifically, R2RDreamer first performs lightweight 3D augmentation by editing incomplete object pointclouds and end-effector trajectories in a shared 3D frame; it then projects the edited scene into masked image-space control videos with occlusion-aware reasoning and uses a dense-control image-to-video model to complete temporally coherent RGB observations. Experiments on spatially shifted manipulation tasks with both 2D diffusion-style policies and vision-language-action policies show that R2RDreamer improves spatial generalization from limited source demonstrations, with analyses validating the contributions of 3D editing, occlusion-aware projection, and video completion.

翻译：空间泛化对于通过模仿学习的操作策略至关重要，但通常需要跨不同物体姿态、机器人配置和相机视角扩展演示样本。从少量源演示中进行数据增强是替代昂贵真实世界数据采集的实用方案。基于仿真的增强虽可生成可控变化，却需复杂环境与物体搭建，且可能引入仿真到现实的差距。近期实到实方法通过联合编辑真实演示的三维观测与动作轨迹规避了上述问题，但仍依赖强三维场景解析与几何补全，且常生成适配三维点云策略而非基于RGB二维策略的观测。我们提出R2RDreamer，一种保持三维动作-观测编辑几何一致性，同时将视觉补全转移到二维视频空间的实到实演示增强框架。具体而言，R2RDreamer首先通过编辑共享三维坐标系下的不完整物体点云与末端执行器轨迹执行轻量级三维增强；随后将编辑后的场景以遮挡感知推理投射至掩膜图像空间控制视频，并利用密集控制图像到视频模型补全时间连贯的RGB观测。在空间偏移操作任务上的实验表明，针对二维扩散风格策略与视觉-语言-动作策略，R2RDreamer可从有限源演示中提升空间泛化能力，分析验证了三维编辑、遮挡感知投射与视频补全的贡献。