FocusVLA: Focused Visual Utilization for Vision-Language-Action Models

Vision-Language-Action (VLA) models improve action generation by conditioning policies on rich vision-language information. However, current auto-regressive policies are constrained by three bottlenecks: (1) architectural bias drives models to overlook visual details, (2) an excessive number of visual tokens makes attention difficult to focus on the correct regions, and (3) task-irrelevant visual information introduces substantial noise - together severely impairing the quality of action. In this paper, we investigate how to effectively utilize different visual representations for action generation. To this end, we first empirically validate the above issues and show that VLA performance is primarily limited by how visual information is utilized, rather than by the quality of visual representations. Based on these insights, we introduce FocusVLA, a novel paradigm that directs the model's attention to task-relevant visual regions to effectively bridge vision to action. Specifically, we first propose Modality Cascaded Attention to eliminate shortcut pathways, thereby compelling VLA models to rely on task-relevant visual details for action generation. Furthermore, we propose Focus Attention, which dynamically selects task-relevant visual patches to control information quantity while explicitly modulating their influence to suppress task-irrelevant noise. Extensive experiments on both simulated and real-world robotic benchmarks demonstrate that FocusVLA not only effectively leverages visual details to perform dexterous manipulations, but also substantially improves performance and accelerates convergence across a variety of tasks.

翻译：视觉-语言-动作(VLA)模型通过将策略条件建立在丰富的视觉-语言信息上来改进动作生成。然而，当前的自回归策略受到三个瓶颈制约：(1) 架构偏差导致模型忽略视觉细节，(2) 过量的视觉标记使得注意力难以聚焦于正确区域，(3) 与任务无关的视觉信息引入大量噪声——这三者共同严重损害了动作质量。本文研究了如何有效利用不同视觉表征进行动作生成。为此，我们首先通过实验验证了上述问题，并表明VLA性能主要受限于视觉信息的利用方式，而非视觉表征的质量。基于这些发现，我们提出FocusVLA——一种引导模型关注任务相关视觉区域以有效桥接视觉与动作的新型范式。具体而言，我们首先提出模态级联注意力机制来消除捷径路径，从而迫使VLA模型依赖任务相关的视觉细节进行动作生成。此外，我们提出聚焦注意力机制，它能动态选择任务相关的视觉块来控制信息量，同时显式调节其影响以抑制任务无关噪声。在仿真和真实机器人基准测试上的大量实验表明，FocusVLA不仅能有效利用视觉细节执行灵巧操作，还能显著提升性能并加速各类任务的收敛。