EgoInfinity: A Web-Scale 4D Hand-Object Interaction Data Engine for Any-View Robot Retargeting and Video-to-Action Robot Learning

Internet videos constitute the largest reservoir of embodied human manipulation knowledge, yet converting arbitrary RGB footage into actionable robot training data remains a major bottleneck. Existing lab- or factory-collected datasets are narrow in scale and diversity, limiting open-world robot learning. Instead of proposing a static dataset, we introduce EgoInfinity, a universal 4D hand-object interaction data engine that enables web-scale data generation for robot retargeting and learning. EgoInfinity is a modular engine integrating perception, segmentation, reconstruction, interaction-aware refinement, and retargeting to automate this traditionally unscalable video-to-action problem without human-in-the-loop annotation. Its modular design lets the engine continuously benefit from advances in any incorporated component. With EgoInfinity, in-the-wild human manipulation videos are lifted into agent-agnostic, metric 4D hand-object representations, including hand trajectories, 6-DoF object poses, and contact-relevant states. Rather than naively connecting standalone components, EgoInfinity combines cross-module metric calibration with interaction-aware refinement to improve physical reliability, reducing drift and contact inconsistencies common in pure visual reconstruction. We further propose a novel motion retargeter that compiles the recovered 3D hand motions into executable joint trajectories for diverse robot morphologies, enabling video-to-action retargeting on any robot from arbitrary viewpoints and shot sizes (e.g., the human body is only partially visible). We validate EgoInfinity across perception fidelity, kinematic feasibility, contact consistency, cross-embodiment generalization, and real-robot skill acquisition (e.g., grasping, cutting, wiping, and pouring), demonstrating a scalable bridge from internet videos to executable robot behavior for open-world robot learning.

翻译：互联网视频构成了人类技能化操纵知识的最大宝库，然而将任意RGB视频片段转化为可执行的机器人训练数据仍是一大瓶颈。现有的实验室或工厂采集数据集在规模与多样性上存在局限，制约了开放世界机器人学习。本文不提出静态数据集，而是引入EgoInfinity——一种通用型4D手物交互数据引擎，支持为机器人重定向与学习生成万维规模数据。EgoInfinity是一个集成感知、分割、重建、交互感知优化与重定向的模块化引擎，旨在将传统上难以扩展的“视频到动作”问题自动化，无需人工标注干预。其模块化设计使其可持续受益于各组成模块的技术进步。借助EgoInfinity，野外人体操控视频可被提升为与代理无关的度量4D手物交互表征，包括手部轨迹、6自由度物体姿态及接触相关状态。不同于简单串联独立组件，EgoInfinity结合跨模块度量校准与交互感知优化，提升物理可靠性，减少纯视觉重建中常见的漂移与接触不一致问题。我们进一步提出一种新型运动重定向器，将恢复的3D手部运动编译为适用于多种机器人形态的可执行关节轨迹，从而实现任意视角与景别（如人体仅部分可见）下从视频到面向任意机器人的动作重定向。我们在感知保真度、运动学可行性、接触一致性、跨实体泛化及真实机器人技能习得（如抓取、切割、擦拭、倾倒）等多个维度验证了EgoInfinity，展示了从互联网视频到可执行机器人行为的可扩展桥梁，助力开放世界机器人学习。