Why do pretrained diffusion or flow-matching policies fail when the same task is performed near an obstacle, on a shifted support surface, or amid mild clutter? Such failures rarely reflect missing motor skills; instead, they expose a limitation of imitation learning under train-test shifts, where action generation is tightly coupled to training-specific spatial configurations and task specifications. Retraining or fine-tuning to address these failures is costly and conceptually misaligned, as the required behaviors already exist but cannot be selectively adapted at test time. We propose Vision-Language Steering (VLS), a training-free framework for inference-time adaptation of frozen generative robot policies. VLS treats adaptation as an inference-time control problem, steering the sampling process of a pretrained diffusion or flow-matching policy in response to out-of-distribution observation-language inputs without modifying policy parameters. By leveraging vision-language models to synthesize trajectory-differentiable reward functions, VLS guides denoising toward action trajectories that satisfy test-time spatial and task requirements. Across simulation and real-world evaluations, VLS consistently outperforms prior steering methods, achieving a 31% improvement on CALVIN and a 13% gain on LIBERO-PRO. Real-world deployment on a Franka robot further demonstrates robust inference-time adaptation under test-time spatial and semantic shifts. Project page: https://vision-language-steering.github.io/webpage/
翻译:为何预训练的扩散或流匹配策略在障碍物附近、偏移的支撑面上或轻度杂乱环境中执行相同任务时会失败?此类失败很少源于运动技能的缺失,而是揭示了训练-测试偏移下模仿学习的局限性:动作生成与训练时特定的空间配置和任务规范紧密耦合。通过重新训练或微调来解决这些失败不仅成本高昂,且在概念上存在偏差,因为所需行为本身已存在,却无法在测试时被选择性适配。我们提出视觉语言引导,这是一种无需训练的框架,用于对冻结的生成式机器人策略进行推理时适配。VLS将适配视为推理时的控制问题,在不修改策略参数的前提下,引导预训练的扩散或流匹配策略的采样过程,以响应分布外观察-语言输入。通过利用视觉语言模型合成轨迹可微的奖励函数,VLS引导去噪过程生成满足测试时空间与任务要求的动作轨迹。在仿真与真实世界评估中,VLS持续优于现有引导方法,在CALVIN上实现31%的性能提升,在LIBERO-PRO上获得13%的增益。在Franka机器人上的真实世界部署进一步证明了其在测试时空间与语义偏移下具备稳健的推理时适配能力。项目页面:https://vision-language-steering.github.io/webpage/