视觉-语言-动作模型的对比表示正则化 (Contrastive Representation Regularization for Vision-Language-Action Models)

Vision-Language-Action (VLA) models have shown its capabilities in robot manipulation by leveraging rich representations from pre-trained Vision-Language Models (VLMs). However, their representations arguably remain suboptimal, lacking sensitivity to robotic signals such as control actions and proprioceptive states. To address the issue, we introduce Robot State-aware Contrastive Loss (RS-CL), a simple and effective representation regularization for VLA models, designed to bridge the gap between VLM representations and robotic signals. In particular, RS-CL aligns the representations more closely with the robot's proprioceptive states, by using relative distances between the states as soft supervision. Complementing the original action prediction objective, RS-CL effectively enhances control-relevant representation learning, while being lightweight and fully compatible with standard VLA training pipeline. Our empirical results demonstrate that RS-CL substantially improves the manipulation performance of state-of-the-art VLA models; it pushes the prior art from 30.8% to 41.5% on pick-and-place tasks in RoboCasa-Kitchen, through more accurate positioning during grasping and placing, and boosts success rates from 45.0% to 58.3% on challenging real-robot manipulation tasks.

翻译：视觉-语言-动作（VLA）模型通过利用预训练视觉-语言模型（VLM）的丰富表示，已在机器人操作任务中展现出强大能力。然而，其表示能力仍存在不足，对控制动作、本体感知状态等机器人信号缺乏敏感性。为解决此问题，我们提出机器人状态感知对比损失（RS-CL），这是一种面向VLA模型的简洁高效表示正则化方法，旨在弥合VLM表示与机器人信号之间的语义鸿沟。该方法通过以机器人状态间的相对距离作为软监督信号，使模型表示与机器人本体感知状态更紧密对齐。RS-CL作为原始动作预测目标函数的补充，能有效增强控制相关表示的学习，同时具备轻量化特性且完全兼容标准VLA训练流程。实验结果表明：RS-CL显著提升了前沿VLA模型的操作性能——在RoboCasa-Kitchen的抓放任务中，通过提升抓取与放置过程中的定位精度，将现有最佳结果从30.8%提升至41.5%；在具有挑战性的真实机器人操作任务中，成功率从45.0%提升至58.3%。