H2R: A Human-to-Robot Data Augmentation for Robot Pre-training from Videos

Large-scale pre-training using egocentric human videos has proven effective for robot learning. However, the models pre-trained on such data can be suboptimal for robot learning due to the significant visual gap between human hands and those of different robots. To remedy this, we propose H2R, a human-to-robot data augmentation pipeline that converts egocentric human videos into robot-centric visual data. H2R estimates human hand pose from videos, retargets the motion to simulated robotic arms, removes human limbs via segmentation and inpainting, and composites rendered robot embodiments into the original frames with camera-aligned geometry. This process explicitly bridges the visual gap between human and robot embodiments during pre-training. We apply H2R to augment large-scale egocentric human video datasets such as Ego4D and SSv2. To verify the effectiveness of the augmentation pipeline, we introduce a CLIP-based image-text similarity metric that quantitatively evaluates the semantic fidelity of robot-rendered frames to the original human actions. We evaluate H2R through comprehensive experiments in both simulation and real-world settings. In simulation, H2R consistently improves downstream success rates across four benchmark suites-Robomimic, RLBench, PushT, and CortexBench-yielding gains of 1.3%-10.2% across different visual encoders and policy learning methods. In real-world experiments, H2R improves performance on UR5 and dual-arm Franka/UR5 manipulation platforms, achieving 3.3%-23.3% success rate gains across gripper-based, dexterous, and bimanual tasks. We further demonstrate the potential of H2R in cross-embodiment generalization and its compatibility with vision-language-action models. These results indicate that H2R improves the generalization ability of robotic policies by mitigating the visual discrepancies between human and robot domains.

翻译：大规模使用以人为中心的视频进行预训练已被证明对机器人学习是有效的。然而，由于人类手部与不同机器人手部之间存在显著的视觉差异，基于此类数据预训练的模型对于机器人学习可能并非最优。为解决此问题，我们提出了H2R，一种人机数据增强流程，可将以人为中心的视频转换为以机器人为中心的视觉数据。H2R从视频中估计人手姿态，将运动重定向到模拟的机械臂，通过分割和修复移除人体肢体，并将渲染的机器人实体以相机对齐的几何结构合成到原始帧中。该过程在预训练阶段显式地弥合了人类与机器人实体之间的视觉差异。我们应用H2R来增强大规模以人为中心的视频数据集，如Ego4D和SSv2。为验证该增强流程的有效性，我们引入了一种基于CLIP的图像-文本相似度度量，用于定量评估机器人渲染帧相对于原始人类动作的语义保真度。我们通过仿真和真实环境的综合实验评估H2R。在仿真中，H2R在四个基准测试套件——Robomimic、RLBench、PushT和CortexBench——中持续提升下游任务成功率，在不同的视觉编码器和策略学习方法上获得了1.3%–10.2%的性能提升。在真实世界实验中，H2R在UR5以及双臂Franka/UR5操作平台上提升了性能，在基于夹爪、灵巧操作和双手操作任务中实现了3.3%–23.3%的成功率提升。我们进一步展示了H2R在跨实体泛化方面的潜力及其与视觉-语言-动作模型的兼容性。这些结果表明，H2R通过减轻人类与机器人领域之间的视觉差异，提升了机器人策略的泛化能力。