Vision-Language-Action (VLA) models have advanced robotic manipulation by combining vision, language, and proprioception to predict actions. However, previous methods fuse proprioceptive signals directly with VLM-encoded vision-language features, resulting in state-dominant bias and false completions despite visible execution failures. We attribute this to modality imbalance, where policies over-rely on internal state while underusing visual evidence. To address this, we present ReViP, a novel VLA framework with Vision-Proprioception Rebalance to enhance visual grounding and robustness under perturbations. The key insight is to introduce auxiliary task-aware environment priors to adaptively modulate the coupling between semantic perception and proprioceptive dynamics. Specifically, we use an external VLM as a task-stage observer to extract real-time task-centric visual cues from visual observations, which drive a Vision-Proprioception Feature-wise Linear Modulation to enhance environmental awareness and reduce state-driven errors. Moreover, to evaluate false completion, we propose the first False-Completion Benchmark Suite built on LIBERO with controlled settings such as Object-Drop. Extensive experiments show that ReViP effectively reduces false-completion rates and improves success rates over strong VLA baselines on our suite, with gains extending to LIBERO, RoboTwin 2.0, and real-world evaluations.
翻译:视觉-语言-动作(VLA)模型通过融合视觉、语言与本体感知来预测动作,推动了机器人操作任务的发展。然而,现有方法将本体感知信号直接与视觉语言模型编码的视觉-语言特征进行融合,导致状态主导偏差,即使在执行失败可见时仍产生误完成现象。我们将此归因于模态失衡,即策略过度依赖内部状态而未能充分利用视觉证据。为解决此问题,本文提出ReViP——一种具有视觉-本体感知再平衡机制的新型VLA框架,旨在增强视觉基础能力及扰动下的鲁棒性。其核心思想是引入辅助性的任务感知环境先验,以自适应地调节语义感知与本体感知动态之间的耦合。具体而言,我们利用外部视觉语言模型作为任务阶段观察器,从视觉观测中提取实时任务中心视觉线索,进而驱动视觉-本体感知特征级线性调制模块,以增强环境感知能力并减少状态驱动的错误。此外,为评估误完成现象,我们基于LIBERO平台构建了首个误完成基准测试套件,其中包含物体掉落等受控场景。大量实验表明,在我们的测试套件上,ReViP相比现有强VLA基线能有效降低误完成率并提升成功率,该优势进一步延伸至LIBERO、RoboTwin 2.0及真实世界评估中。