Vision-Language-Action (VLA) models map visual observations and language instructions directly to robotic actions. While effective for simple tasks, standard VLA models often struggle with complex, multi-step tasks requiring logical planning, as well as precise manipulations demanding fine-grained spatial perception. Recent efforts have incorporated Chain-of-Thought (CoT) reasoning to endow VLA models with a ``thinking before acting'' capability. However, current CoT-based VLA models face two critical limitations: 1) an inability to simultaneously capture low-level visual details and high-level logical planning due to their reliance on isolated, single-modal CoT; 2) high inference latency with compounding errors caused by step-by-step autoregressive decoding. To address these limitations, we propose DualCoT-VLA, a visual-linguistic CoT method for VLA models with a parallel reasoning mechanism. To achieve comprehensive multi-modal reasoning, our method integrates a visual CoT for low-level spatial understanding and a linguistic CoT for high-level task planning. Furthermore, to overcome the latency bottleneck, we introduce a parallel CoT mechanism that incorporates two sets of learnable query tokens, shifting autoregressive reasoning to single-step forward reasoning. Extensive experiments demonstrate that our DualCoT-VLA achieves state-of-the-art performance on the LIBERO and RoboCasa GR1 benchmarks, as well as in real-world platforms.
翻译:视觉-语言-动作(VLA)模型将视觉观测和语言指令直接映射为机器人动作。尽管在简单任务中效果显著,但标准VLA模型在处理需要逻辑规划的复杂多步任务以及要求精细空间感知的精密操作时仍存在困难。近期研究尝试通过引入思维链(CoT)推理赋予VLA模型"先思后行"的能力。然而,现有基于CoT的VLA模型面临两个关键限制:1)依赖孤立单模态CoT导致无法同时捕获底层视觉细节与高层逻辑规划;2)逐步自回归解码带来高推理延迟与累积误差。针对这些局限,我们提出DualCoT-VLA——一种采用并行推理机制的VLA模型视觉-语言CoT方法。为实现全面的多模态推理,本方法融合了面向底层空间理解的视觉CoT与面向高层任务规划的语言CoT。此外,为突破延迟瓶颈,我们引入并行CoT机制,通过两组可学习查询令牌将自回归推理转化为单步前向推理。大量实验证明,DualCoT-VLA在LIBERO、RoboCasa GR1基准测试及真实平台均达到当前最优性能。