Vision-Language-Action (VLA) models excel in robotic manipulation but suffer from significant inference latency due to processing dense visual tokens. Existing token reduction methods predominantly rely on attention magnitude as a static selection. In this work, we challenge this assumption, revealing that high-attention tokens are task-dependent and can even degrade policy performance. To address this, we introduce \textbf{TIES} (\textbf{T}au-guided \textbf{I}nter-layer \textbf{E}fficient \textbf{S}election), a dynamic framework guided by inter-layer token ranking consistency. By adaptively balancing attention magnitude with ranking consistency, TIES ensures robust token selection without requiring additional training. On the CogACT + SIMPLER benchmark, TIES improves average success rates by 6\% while reducing token usage by 78\%, and demonstrate strong generalization across diverse decoders and benchmarks.
翻译:视觉-语言-动作(VLA)模型在机器人操作中表现出色,但由于处理密集的视觉标记(token)而面临显著的推理延迟问题。现有的标记缩减方法主要依赖注意力幅度(attention magnitude)作为静态选择依据。在本工作中,我们质疑了这一假设,揭示出高注意力标记具有任务依赖性,甚至可能降低策略性能。为此,我们提出\textbf{TIES}(\textbf{T}au引导的\textbf{I}层间\textbf{E}高效\textbf{S}选择)——一种由层间标记排序一致性引导的动态框架。通过自适应地平衡注意力幅度与排序一致性,TIES在无需额外训练的情况下确保了稳健的标记选择。在CogACT + SIMPLER基准测试中,TIES将平均成功率提升了6%,同时将标记使用量减少了78%,并在多种解码器及基准测试中展现出强大的泛化能力。