Generalist Vision-Language-Action models remain constrained by the scarcity of robotic data relative to the abundance of human video demonstrations. Existing Latent Action Models attempt to use video data but often suffer from visual entanglement, encoding noise rather than manipulation skills. To address this limitation, we propose Contrastive Latent Action Pretraining (CLAP), a framework that first uses Act-VAE to learn an executable action-token vocabulary from robot trajectories and then aligns human visual transitions with this vocabulary through contrastive learning. This alignment maps unlabeled human videos into a physically grounded latent action space rather than reconstructing appearance. Building on the aligned tokens, we train CLAP-NTP as an autoregressive VLA using robot demonstrations and pseudo-labeled human videos, preserving instruction following and object generalization. For deployment and target-domain adaptation, we further introduce a post-training strategy that combines CLAP-RF, a Rectified Flow action head for low-latency continuous action chunk prediction, with Knowledge Matching regularization to preserve pretrained semantic knowledge during fine-tuning. Extensive experiments show that CLAP achieves strong performance against competitive baselines while enabling effective skill transfer from human videos to robotic execution.
翻译:摘要:通用型视觉-语言-动作模型仍受限于机器人数据稀缺与人类视频演示数据富集之间的矛盾。现有潜动作模型尝试利用视频数据,但常因视觉纠缠而编码噪声而非操作技能。为突破此局限,我们提出对比潜动作预训练框架,该框架首先通过Act-VAE从机器人轨迹中学习可执行的动作令牌词汇,继而借助对比学习将人类视觉状态转换与该词汇对齐。此对齐机制将无标注人类视频映射至物理具身的潜动作空间,而非重建表观特征。基于对齐令牌,我们利用机器人演示与伪标注人类视频训练自回归视觉-语言-动作模型,保持指令跟随与物体泛化能力。面向部署与目标域自适应,我们进一步引入后训练策略,结合低延迟连续动作块预测的整流流动作头与知识匹配正则化,在微调过程中保留预训练语义知识。大量实验表明,CLAP在达到与强基线相当性能的同时,实现了从人类视频到机器人执行的有效技能迁移。