Dexterous manipulation is limited by the cost of collecting large-scale robot demonstrations. Egocentric human videos offer a scalable source of diverse manipulation behaviors, but directly using them for robot learning requires bridging two gaps: the visual gap between human and robot observations, and the action gap between human motion and robot-executable action. We propose EgoEngine, a scalable framework for transforming egocentric human manipulation videos into high-fidelity robot data. Given an egocentric RGB video, EgoEngine produces: (i) a high-fidelity robot observation video replacing human with robot while preserving scene context and temporal alignment, and (ii) a task-aligned, executable robot action trajectory under feasibility constraints. Experiments in simulation and on real robots show that EgoEngine enables scalable conversion of human videos into robot data and, to our knowledge, demonstrates the first zero-shot visuomotor dexterous policy learning from egocentric human videos without real-robot demonstrations. Project website: https://egoengine.github.io.
翻译:灵巧操作受限于大规模机器人演示数据采集的高昂成本。第一人称人类视频提供了多样化的操作行为可扩展来源,但直接用于机器人学习需弥合两大鸿沟:人类与机器人观测之间的视觉差异,以及人类运动与机器人可执行动作之间的行为差异。我们提出EgoEngine——一种将第一人称人类操作视频转化为高保真机器人数据的可扩展框架。给定第一人称RGB视频,EgoEngine可生成:(i) 保留场景上下文与时序对齐的、将人类替换为机器人的高保真机器人观测视频;(ii) 在可行性约束下任务对齐且可执行的机器人动作轨迹。仿真实验与真实机器人实验表明,EgoEngine实现了人类视频到机器人数据的可扩展转换,据我们所知,首次在没有真实机器人演示的条件下,基于第一人称人类视频实现了零样本视触觉灵巧策略学习。项目网站:https://egoengine.github.io。