Recent advances in Vision-Language-Action (VLA) models have shown promise for robot control, but their dependence on action supervision limits scalability and generalization. To address this challenge, we introduce CARE, a novel framework designed to train VLA models for robotic task execution. Unlike existing methods that depend on action annotations during pretraining, CARE eliminates the need for explicit action labels by leveraging only video-text pairs. These weakly aligned data sources enable the model to learn continuous latent action representations through a newly designed multi-task pretraining objective. During fine-tuning, a small set of labeled data is used to train the action head for control. Experimental results across various simulation tasks demonstrate CARE's superior success rate, semantic interpretability, and ability to avoid shortcut learning. These results underscore CARE's scalability, interpretability, and effectiveness in robotic control with weak supervision.
翻译:近年来,视觉-语言-动作(VLA)模型在机器人控制领域展现出潜力,但其对动作监督的依赖限制了可扩展性和泛化能力。为应对这一挑战,我们提出了CARE,一个为机器人任务执行训练VLA模型的新型框架。与现有方法在预训练期间依赖动作标注不同,CARE通过仅利用视频-文本对,消除了对显式动作标签的需求。这些弱对齐的数据源使模型能够通过新设计的多任务预训练目标,学习连续的潜在动作表示。在微调阶段,使用少量标注数据来训练用于控制的动作头。在多种仿真任务上的实验结果表明,CARE具有更高的成功率、语义可解释性以及避免捷径学习的能力。这些结果凸显了CARE在弱监督机器人控制中的可扩展性、可解释性和有效性。