See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs

Generalization remains a central bottleneck for vision-language-action (VLA) models: under distractors, appearance shifts, and semantically similar tasks, the policy must often infer local execution details from coarse instructions while also deciding which parts of the image matter for control. We present S2 (See Less, Specify More), a framework for improving VLA generalization by training the executor under a cleaner interface. Specify More preserves the original instruction as a stable high-level goal while relabeling each trajectory into refined trajectory- and subtask-level language that disambiguates the current execution mode. Unlike native attention, See Less imposes an explicit visual evidence budget, training the executor to act from task-sufficient evidence rather than unconstrained visual context, without any region or mask annotation. This interface lets the executor follow detailed guidance without relying on distracting visual patches or resolving avoidable ambiguity on its own, and it remains compatible with off-the-shelf VLM planners through in-context learning. Across our main evaluation settings, S2 improves overall generalization metrics by changing the executor's learning problem: coarse instructions induce avoidable supervision aliasing, goal-preserving local guidance outperforms instruction replacement in our main ablations, and explicit evidence budgeting reduces dependence on broad visual context beyond efficiency considerations. Across eight real-robot tasks on TX-G2 (an AgiBot G2-compatible variant) and HSR, S2 raises mean subtask success from 54.2% to 79.0% over pi0.5. Together, these results suggest that VLA generalization improves when the executor is trained to act from informative local guidance and task-sufficient visual evidence, rather than recovering both from weak supervision.

翻译：泛化能力仍是视觉-语言-动作（VLA）模型的核心瓶颈：在干扰物、外观变化和语义相似任务下，策略通常需要从粗粒度指令中推断局部执行细节，同时决定图像中哪些部分对控制至关重要。我们提出S2（看得更少，指定更多）框架，通过训练执行器在更清晰的接口下工作来提升VLA泛化能力。"指定更多"保留原始指令作为稳定的高层目标，同时将每条轨迹重新标注为细化的轨迹级和子任务级语言，以消除当前执行模式的歧义。与原生注意力不同，"看得更少"施加显式视觉证据预算，训练执行器基于任务充分的证据而非无约束的视觉上下文进行决策，无需任何区域或掩码标注。该接口使执行器能够遵循详细引导，而不依赖干扰性视觉块或自行解决可避免的歧义，同时通过上下文学习保持与现成VLM规划器的兼容性。在主要评估设置中，S2通过改变执行器的学习问题提升了整体泛化指标：粗粒度指令导致可避免的监督混叠，保留目标的局部引导在主要消融实验中优于指令替换，显式证据预算在效率考量之外降低了对广泛视觉上下文的依赖。在TX-G2（与AgiBot G2兼容的变体）和HSR上的八项真实机器人任务中，S2将平均子任务成功率从pi0.5的54.2%提升至79.0%。综合结果表明，当执行器被训练基于信息丰富的局部引导和任务充分的视觉证据（而非从弱监督中同时恢复两者）进行决策时，VLA泛化能力会得到显著提升。