Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for robotic manipulation, in which reliable action prediction critically depends on accurately interpreting and integrating visual observations conditioned on language instructions. Although recent works have sought to enhance the visual capabilities of VLA models, most approaches treat the LLM backbone as a black box, providing limited insight into how visual information is grounded into action generation. Therefore, we perform a systematic analysis of multiple VLA models across different action-generation paradigms and observe that sensitivity to visual tokens progressively decreases in deeper layers during action generation. Motivated by this observation, we propose \textbf{DeepVision-VLA}, built on a \textbf{Vision-Language Mixture-of-Transformers (VL-MoT)} framework. This framework enables shared attention between the vision foundation model and the VLA backbone, injecting multi-level visual features from the vision expert into deeper layers of the VLA backbone to enhance visual representations for precise and complex manipulation. In addition, we introduce \textbf{Action-Guided Visual Pruning (AGVP)}, which leverages shallow-layer attention to prune irrelevant visual tokens while preserving task-relevant ones, reinforcing critical visual cues for manipulation with minimal computational overhead. DeepVision-VLA outperforms prior state-of-the-art methods by 9.0\% and 7.5\% on simulated and real-world tasks, respectively, providing new insights for the design of visually enhanced VLA models.
翻译:视觉-语言-动作(VLA)模型近期已成为机器人操作领域一种颇具前景的范式,其可靠的动作预测关键在于准确解释并整合以语言指令为条件的视觉观测。尽管近期研究致力于增强VLA模型的视觉能力,但多数方法将大语言模型(LLM)主干视为黑箱,对视觉信息如何被融入动作生成的机制揭示有限。为此,我们对多种不同动作生成范式下的VLA模型进行了系统性分析,发现其在动作生成过程中对视觉令牌的敏感性随网络层深度增加而逐层递减。基于此观察,我们提出了基于**视觉-语言混合Transformer(VL-MoT)** 框架的 **DeepVision-VLA**。该框架实现了视觉基础模型与VLA主干之间的注意力共享,将来自视觉专家的多层次视觉特征注入VLA主干的更深层,从而增强视觉表征以支持精确且复杂的操作。此外,我们引入了**动作引导的视觉剪枝(AGVP)**,该方法利用浅层注意力机制剪除无关视觉令牌,同时保留任务相关令牌,以最小计算开销强化对操作至关重要的视觉线索。DeepVision-VLA在仿真与真实世界任务中分别以9.0%和7.5%的优势超越先前最先进方法,为设计视觉增强型VLA模型提供了新的见解。