H2R：一种用于视频机器人预训练的人机数据增强方法 (H2R: A Human-to-Robot Data Augmentation for Robot Pre-training from Videos)

Large-scale pre-training using videos has proven effective for robot learning. However, the models pre-trained on such data can be suboptimal for robot learning due to the significant visual gap between human hands and those of different robots. To remedy this, we propose H2R, a simple data augmentation technique that detects human hand keypoints, synthesizes robot motions in simulation, and composites rendered robots into egocentric videos. This process explicitly bridges the visual gap between human and robot embodiments during pre-training. We apply H2R to augment large-scale egocentric human video datasets such as Ego4D and SSv2, replacing human hands with simulated robotic arms to generate robot-centric training data. Based on this, we construct and release a family of 1M-scale datasets covering multiple robot embodiments (UR5 with gripper/Leaphand, Franka) and data sources (SSv2, Ego4D). To verify the effectiveness of the augmentation pipeline, we introduce a CLIP-based image-text similarity metric that quantitatively evaluates the semantic fidelity of robot-rendered frames to the original human actions. We validate H2R across three simulation benchmarks: Robomimic, RLBench and PushT and real-world manipulation tasks with a UR5 robot equipped with Gripper and Leaphand end-effectors. H2R consistently improves downstream success rates, yielding gains of 5.0%-10.2% in simulation and 6.7%-23.3% in real-world tasks across various visual encoders and policy learning methods. These results indicate that H2R improves the generalization ability of robotic policies by mitigating the visual discrepancies between human and robot domains.

翻译：大规模视频预训练已被证明对机器人学习有效。然而，由于人类手部与不同机器人手部之间存在显著的视觉差异，基于此类数据预训练的模型对于机器人学习可能并非最优。为弥补这一不足，我们提出H2R，一种简单的数据增强技术：该方法检测人体手部关键点，在仿真环境中合成机器人运动，并将渲染的机器人合成到第一人称视角视频中。这一过程在预训练阶段显式地弥合了人类与机器人形态之间的视觉差异。我们将H2R应用于增强大规模第一人称人类视频数据集（如Ego4D和SSv2），通过用仿真的机械臂替换人类手部，生成以机器人为中心的训练数据。基于此，我们构建并发布了一个涵盖多种机器人形态（UR5夹爪/Leaphand手、Franka）与数据源（SSv2、Ego4D）的百万级数据集家族。为验证增强流程的有效性，我们引入了一种基于CLIP的图像-文本相似度度量，可定量评估机器人渲染帧相对于原始人类动作的语义保真度。我们在三个仿真基准测试（Robomimic、RLBench和PushT）以及配备夹爪和Leaphand末端执行器的UR5机器人真实世界操作任务中验证了H2R。实验表明，H2R持续提升了下游任务的成功率：在不同视觉编码器和策略学习方法中，仿真任务获得5.0%-10.2%的性能提升，真实世界任务获得6.7%-23.3%的提升。这些结果表明，H2R通过缓解人机领域间的视觉差异，有效提升了机器人策略的泛化能力。