Recent vision-language-action (VLA) models built upon pretrained vision-language models (VLMs) have achieved significant improvements in robotic manipulation. However, current VLAs still suffer from low sample efficiency and limited generalization. This paper argues that these limitations are closely tied to an overlooked component, pretrained visual representation, which offers insufficient knowledge on both aspects of environment understanding and policy prior. Through an in-depth analysis, we find that commonly used visual representations in VLAs, whether pretrained via language-image contrastive learning or image-based self-supervised learning, remain inadequate at capturing crucial, task-relevant environment information and at inducing effective policy priors, i.e., anticipatory knowledge of how the environment evolves under successful task execution. In contrast, we discover that predictive embeddings pretrained on videos, in particular V-JEPA 2, are adept at flexibly discarding unpredictable environment factors and encoding task-relevant temporal dynamics, thereby effectively compensating for key shortcomings of existing visual representations in VLAs. Building on these observations, we introduce JEPA-VLA, a simple yet effective approach that adaptively integrates predictive embeddings into existing VLAs. Our experiments demonstrate that JEPA-VLA yields substantial performance gains across a range of benchmarks, including LIBERO, LIBERO-plus, RoboTwin2.0, and real-robot tasks.
翻译:近年来,基于预训练视觉语言模型构建的视觉-语言-动作模型在机器人操作任务中取得了显著进展。然而,当前VLA模型仍存在样本效率低和泛化能力有限的问题。本文认为,这些局限性与一个被忽视的组件——预训练视觉表征——密切相关,该组件在环境理解和策略先验两方面均未能提供充分的知识。通过深入分析,我们发现VLA中常用的视觉表征(无论通过语言-图像对比学习还是基于图像的自监督学习预训练)在捕获关键的任务相关环境信息以及诱导有效的策略先验(即对任务成功执行时环境演变趋势的预判性知识)方面仍存在不足。相比之下,我们发现基于视频预训练的预测性嵌入(特别是V-JEPA 2)能够灵活剔除不可预测的环境因素,编码与任务相关的时序动态特征,从而有效弥补现有VLA视觉表征的关键缺陷。基于这些观察,我们提出了JEPA-VLA——一种将预测性嵌入自适应集成到现有VLA中的简洁而有效的方法。实验表明,JEPA-VLA在LIBERO、LIBERO-plus、RoboTwin2.0等多个基准测试及真实机器人任务中均实现了显著的性能提升。