Existing Vision-Language-Action (VLA) models predominantly rely on explicit Chain-of-Thought (CoT) reasoning to bridge perception and action. While effective, this paradigm suffers from high computational costs and error propagation in multi-step tasks. In this paper, we propose Adaptive Variable Alignment VLA (AVA-VLA), a novel Latent Reasoning VLA framework that models reasoning as a sequence of unobservable latent variables, bypassing the need for explicit text generation. However, latent trajectories are inherently susceptible to noise interference and misalignment with downstream objectives. To address this, we introduce a Reinforcement Learning-based Denoising mechanism that treats latent state generation as a sequential decision process, optimizing reasoning trajectories via task-level rewards. Furthermore, we incorporate an Early-Exit Strategy that adaptively terminates reasoning based on state confidence, enabling a dynamic trade-off between depth and efficiency. Extensive experiments on embodied decision benchmarks demonstrate that AVA-VLA achieves a 6x inference speedup over explicit CoT methods while attaining a 98.3% average success rate on LIBERO, improving both efficiency and long-horizon stability over full-reasoning baselines.
翻译:现有的视觉-语言-行动(VLA)模型主要依赖显式思维链(CoT)推理来桥接感知与行动。这种范式虽然有效,但在多步骤任务中面临高计算成本和误差传播的问题。本文提出自适应变量对齐VLA(AVA-VLA),一种新的潜在推理VLA框架,将推理建模为一组不可观测潜在变量的序列,从而无需显式文本生成。然而,潜在轨迹本身容易受到噪声干扰,并与下游目标存在对齐偏差。为解决这一问题,我们引入基于强化学习的去噪机制,将潜在状态生成视为序列决策过程,通过任务级奖励优化推理轨迹。此外,我们引入早期退出策略,基于状态置信度自适应终止推理,实现推理深度与效率的动态平衡。在具身决策基准上的大量实验表明,AVA-VLA在LIBERO上达到98.3%的平均成功率,相较于显式CoT方法实现6倍推理加速,且在效率与长程稳定性上均优于全推理基线。