Vision--Language--Action (VLA) policies have shown strong progress in mapping language instructions and visual observations to robotic actions, yet their reliability degrades in cluttered scenes with distractors. By analyzing failure cases, we find that many errors do not arise from infeasible motions, but from instance-level grounding failures: the policy often produces a plausible grasp trajectory that lands slightly off-target or even on the wrong object instance. To address this issue, we propose TAG (Target-Agnostic Guidance), a simple inference-time guidance mechanism that explicitly reduces distractor- and appearance-induced bias in VLA policies. Inspired by classifier-free guidance (CFG), TAG contrasts policy predictions under the original observation and an object-erased observation, and uses their difference as a residual steering signal that strengthens the influence of object evidence in the decision process. TAG does not require modifying the policy architecture and can be integrated with existing VLA policies with minimal training and inference changes. We evaluate TAG on standard manipulation benchmarks, including LIBERO, LIBERO-Plus, and VLABench, where it consistently improves robustness under clutter and reduces near-miss and wrong-object executions.
翻译:视觉-语言-动作(VLA)策略在将语言指令和视觉观测映射为机器人动作方面取得了显著进展,但其在杂乱场景中的可靠性因干扰物而下降。通过分析失败案例,我们发现许多错误并非源于不可行运动,而是实例级定位失败:策略往往生成看似合理的抓取轨迹,最终却略微偏离目标甚至抓取错误物体实例。针对这一问题,我们提出TAG(目标无关指导),一种简单的推理时指导机制,可显式减少VLA策略中由干扰物和外观引起的偏差。受无分类器指导(CFG)启发,TAG对比策略在原始观测与物体擦除观测下的预测,利用两者的残差作为残余引导信号,增强物体证据在决策过程中的影响力。TAG无需修改策略架构,且能以最小训练和推理改动集成至现有VLA策略。我们在标准操作基准测试(包括LIBERO、LIBERO-Plus和VLABench)上评估了TAG,结果表明其在杂乱场景下持续提升鲁棒性,并减少近失误及错误物体等执行错误。