Vision-language-action models (VLAs) have shown potential in leveraging pretrained vision-language models and diverse robot demonstrations for learning generalizable sensorimotor control. While this paradigm effectively utilizes large-scale data from both robotic and non-robotic sources, current VLAs primarily focus on direct input--output mappings, lacking the intermediate reasoning steps crucial for complex manipulation tasks. As a result, existing VLAs lack temporal planning or reasoning capabilities. In this paper, we introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs) by predicting future image frames autoregressively as visual goals before generating a short action sequence to achieve these goals. We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens. Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks. Project website: https://cot-vla.github.io/
翻译:视觉语言动作模型(VLAs)已展现出利用预训练视觉语言模型与多样化机器人演示数据来学习泛化性感知运动控制的潜力。尽管该范式能有效利用来自机器人领域与非机器人领域的大规模数据,但当前的VLAs主要关注直接的输入-输出映射,缺乏对复杂操作任务至关重要的中间推理步骤。因此,现有的VLAs缺乏时序规划或推理能力。本文提出一种方法,通过自回归地预测未来图像帧作为视觉目标,并在生成实现这些目标的短动作序列之前,将显式的视觉思维链(CoT)推理融入视觉语言动作模型(VLAs)中。我们提出了CoT-VLA,这是一个能够理解并生成视觉与动作标记的先进70亿参数VLA模型。实验结果表明,CoT-VLA在真实世界操作任务中优于当前最先进的VLA模型17%,在仿真基准测试中提升6%。项目网站:https://cot-vla.github.io/