GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

Xiaosong Jia,Bowen Yang,Zuhao Ge,Xian Nie,Yuchen Zhou,Cunxin Fan,Yufeng Li,Yilin Chai,Chao Jing,Zijian Liang,Qingwen Bu,Haidong Cao,Chao Wu,Qifeng Li,Zhenjie Yang,Chenhe Zhang,Hongyang Li,Zuxuan Wu,Junchi Yan,Yu-Gang Jiang

from arxiv, Accepted to RSS 2026. Project page: https://guidedvla.github.io/project_page/

Vision-Language-Action (VLA) models aim for general robot learning by aligning action as a modality within powerful Vision-Language Models (VLMs). Existing VLAs rely on end-to-end supervision to implicitly enable the action decoding process to learn task-relevant features. However, without explicit guidance, these models often overfit to spurious correlations, such as visual shortcuts or environmental noise, limiting their generalization. In this paper, we introduce GuidedVLA, a framework designed to manually guide the action generation to focus on task-relevant factors. Our core insight is to treat the action decoder not as a monolithic learner, but as an assembly of functional components. Individual attention heads are supervised by manually defined auxiliary signals to capture distinct factors. As an initial study, we instantiate this paradigm with three specialized heads: object grounding, spatial geometry, and temporal skill logic. Across simulation and real-robot experiments, GuidedVLA improves success rates in both in-domain and out-of-domain settings compared to strong VLA baselines. Finally, we show that the quality of these specialized factors correlates positively with task performance and that our mechanism yields decoupled, high-quality features. Our results suggest that explicitly guiding action-decoder learning is a promising direction for building more robust and general VLA models.

翻译：视觉-语言-动作（VLA）模型旨在通过将动作作为强视觉-语言模型（VLM）中的一种模态，实现通用的机器人学习。现有VLA模型依赖端到端监督来隐式地使动作解码过程学习任务相关特征。然而，在缺乏显式指导的情况下，这些模型往往过度拟合虚假相关性（例如视觉捷径或环境噪声），从而限制了其泛化能力。本文提出了GuidedVLA框架，旨在手动引导动作生成过程聚焦于任务相关因素。我们的核心见解是将动作解码器视为功能组件的集合，而非单一的学习器。通过手动定义的辅助信号监督各个注意力头，使其捕获不同的因素。作为初步研究，我们利用三个特化头实例化了这一范式：目标锚定、空间几何和时间技能逻辑。在仿真和真实机器人实验中，与强VLA基线相比，GuidedVLA在域内和域外场景中均提升了成功率。最后，我们证明了这些特化因素的质量与任务性能呈正相关，且我们的机制能够产生解耦的高质量特征。研究结果表明，显式引导动作解码器学习是构建更鲁棒和通用VLA模型的一个有前景的方向。