Generalist Vision-Language-Action models are currently hindered by the scarcity of robotic data compared to the abundance of human video demonstrations. Existing Latent Action Models attempt to leverage video data but often suffer from visual entanglement, capturing noise rather than manipulation skills. To address this, we propose Contrastive Latent Action Pretraining (CLAP), a framework that aligns the visual latent space from videos with a proprioceptive latent space from robot trajectories. By employing contrastive learning, CLAP maps video transitions onto a quantized, physically executable codebook. Building on this representation, we introduce a dual-formulation VLA framework offering both CLAP-NTP, an autoregressive model excelling at instruction following and object generalization, and CLAP-RF, a Rectified Flow-based policy designed for high-frequency, precise manipulation. Furthermore, we propose a Knowledge Matching (KM) regularization strategy to mitigate catastrophic forgetting during fine-tuning. Extensive experiments demonstrate that CLAP significantly outperforms strong baselines, enabling the effective transfer of skills from human videos to robotic execution. Project page: https://lin-shan.com/CLAP/.
翻译:通用型视觉-语言-动作模型目前面临的主要障碍是机器人数据的稀缺性,而人类视频演示数据则相对丰富。现有的潜在动作模型尝试利用视频数据,但常受视觉纠缠问题困扰,捕获的往往是噪声而非操作技能。为解决这一问题,我们提出对比性潜在动作预训练(CLAP),该框架通过对比学习将视频中的视觉潜在空间与机器人轨迹的本体感知潜在空间对齐。CLAP将视频中的状态转移映射到一个可量化的、物理可执行的码本上。基于此表征,我们提出一种双形式VLA框架:其一是擅长指令跟随与物体泛化的自回归模型CLAP-NTP;其二是基于整流流设计的、适用于高频精确操作的策略模型CLAP-RF。此外,我们提出知识匹配(KM)正则化策略,以缓解微调过程中的灾难性遗忘问题。大量实验表明,CLAP显著优于现有基线方法,能够有效实现从人类视频到机器人执行的动作技能迁移。项目页面:https://lin-shan.com/CLAP/。