Vision-Language Action (VLA) models have shown remarkable progress in robotic manipulation by leveraging the powerful perception abilities of Vision-Language Models (VLMs) to understand environments and directly output actions. However, by default, VLA models may overly attend to image tokens in the task-irrelevant region, which we describe as 'distracting tokens'. This behavior can disturb the model from the generation of the desired action tokens in each step, affecting the success rate of tasks. In this paper, we introduce a simple yet effective plug-and-play Distracting Token Pruning (DTP) framework, which dynamically detects and prunes these distracting image tokens. By correcting the model's visual attention patterns, we aim to improve the task success rate, as well as exploring the performance upper boundaries of the model without altering its original architecture or adding additional inputs. Experiments on the SIMPLER Benchmark (Li et al., 2024) show that our method consistently achieving relative improvements in task success rates across different types of novel VLA models, demonstrating generalizability to transformer-based VLAs. Further analysis reveals a negative correlation between the task success rate and the amount of attentions in the task-irrelevant region for all models tested, highlighting a common phenomenon of VLA models that could guide future research. We also publish our code at: https://anonymous.4open.science/r/CBD3.
翻译:视觉语言动作(VLA)模型通过利用视觉语言模型(VLM)强大的感知能力来理解环境并直接输出动作,在机器人操作领域取得了显著进展。然而,默认情况下,VLA模型可能会过度关注任务无关区域的图像令牌,我们将其描述为“干扰令牌”。这种行为会干扰模型在每一步中生成期望的动作令牌,从而影响任务的成功率。本文提出了一种简单而有效的即插即用干扰令牌剪枝(DTP)框架,该框架能够动态检测并剪除这些干扰图像令牌。通过纠正模型的视觉注意力模式,我们的目标是在不改变其原始架构或增加额外输入的情况下,提高任务成功率,并探索模型的性能上限。在SIMPLER基准测试(Li等人,2024)上的实验表明,我们的方法在不同类型的新型VLA模型中均能持续实现任务成功率的相对提升,证明了其对基于Transformer的VLA模型的普适性。进一步分析揭示了所有测试模型的任务成功率与任务无关区域注意力量之间存在负相关关系,这突显了VLA模型的一个普遍现象,可为未来研究提供指导。我们的代码已发布于:https://anonymous.4open.science/r/CBD3。