Contrastive Action-Image Pre-training for Visuomotor Control

Yuvan Sharma,Dantong Niu,Anirudh Pai,Zekai Wang,Zhuoyang Liu,Baifeng Shi,Stefano Saravalle,Boning Shao,Ruijie Zheng,Jing Wang,Konstantinos Kallidromitis,Yusuke Kato,Fabio Galasso,Yuke Zhu,Danfei Xu,Linxi "Jim" Fan,Jitendra Malik,Trevor Darrell,Roei Herzig

Existing vision encoders for robotics face a fundamental bottleneck: robotic datasets lack the scale necessary for large-scale pre-training. Prior work circumvents this data scarcity by turning to internet-scale image and language data or egocentric human video. While these models show promise, neither paradigm learns from paired vision and action data, which downstream visuomotor control policies require. However, robot trajectories, the most direct source of this paired signal, are not available at pre-training scale, motivating us to extract action signals from abundant human video instead. To this end, we introduce CAIP (Contrastive Action-Image Pre-training), a vision encoder that treats human hand poses from large-scale egocentric video as a proxy for end-effector actions. By extracting 3D hand keypoints, a representation that aligns naturally with downstream robot action spaces, CAIP learns a unified action-image representation through a contrastive objective. Leveraging 32,041 hours of egocentric human video and only 88 hours of robotic manipulation data, CAIP outperforms state-of-the-art vision encoders including DINOv2, SigLIP, MVP, and R3M. Evaluated on a challenging real-world dexterous manipulation setup using Dexmate Vega and Sharpa Wave hands, CAIP yields performance gains of more than 30% on tasks involving folding, pouring, and fine-grained manipulation. Our results show that our method of contrastive action-centric pre-training yields a scalable path to achieving robust visual representations better suited for physical interaction.

翻译：现有面向机器人应用的视觉编码器面临根本瓶颈：机器人数据集缺乏大规模预训练所需的数据规模。先前研究通过转向互联网规模的图像与语言数据或以自我为中心的人类视频来规避数据稀缺问题。尽管这些模型展现出潜力，但两种范式均未利用配对视觉-动作数据进行学习——而这正是下游视觉运动控制策略所需的关键信号。然而，最直接的配对信号来源（机器人轨迹）在预训练阶段尚不可得，这促使我们从丰富的人类视频中提取动作信号。为此，我们提出CAIP（对比式动作-图像预训练），这是一种将大规模自我中心视频中的人类手部姿态视为末端执行器动作代理的视觉编码器。通过提取三维手部关键点（一种与下游机器人动作空间自然对齐的表征），CAIP借助对比学习目标学习统一的动作-图像表征。利用32,041小时自我中心人类视频与仅88小时机器人操作数据，CAIP在性能上超越DINOv2、SigLIP、MVP及R3M等最先进视觉编码器。在使用Dexmate Vega与Sharpa Wave机械手的现实挑战性灵巧操作评估中，CAIP在涉及折叠、倾倒及精细操控的任务上取得超过30%的性能提升。实验结果表明，我们提出的以动作为中心的对比式预训练方法，为实现更适配物理交互的鲁棒视觉表征提供了可扩展路径。