Retrieve, Don't Retrain: Extending Vision Language Action Models to New Tasks at Test Time

Extending a vision-language-action (VLA) policy to a new task typically requires task-specific teleoperated demonstrations and per-task fine-tuning, making adaptation costly in both data collection and compute. In this paper, we show that this target-side per-task adaptation cost can be replaced by retrieval. Our retrieval-augmented policy is trained once on paired demonstrations from the target embodiment (query) and a cheaper embodiment (pool, e.g., human-hand video), then frozen. New tasks are added at deployment by appending pool-side demonstrations to a retrieval pool. The frozen policy conditions on retrieved trajectories at every control step, so new tasks are absorbed by indexing data rather than updating parameters. Fine-tuning is needed only to take on a new, unseen embodiment, not for each new task. We show that retrieval improves policies beyond a specific backbone, including standard VLA policies, but its effect is especially pronounced in Cosmos Policy, a video-generation-based world-action model (WAM). In this setting, retrieval supplies coarse task progression, while the WAM's future-image objective provides an additional visual consistency signal that strengthens the retrieval-conditioned actions. On PushT, we study how retrieval provides a reusable high-level motion prior for cross-embodiment generalization to unseen goal angles, while on RoboTwin 2.0 our method outperforms cross-embodiment baselines on unseen tasks, and we additionally demonstrate the method on a real robot.

翻译：将视觉-语言-动作（VLA）策略扩展至新任务通常需要特定于任务的遥操作演示和每任务微调，这使得数据收集和计算两方面的适应成本高昂。在本文中，我们表明这种目标端每任务适应成本可被检索所取代。我们的检索增强策略仅需在目标实体（查询）与更廉价实体（池，例如人类手部视频）的配对演示上训练一次，随后即冻结。在部署阶段，通过将池端演示追加至检索池来添加新任务。冻结策略在每个控制步骤均依据检索到的轨迹进行条件化，因此新任务通过索引数据而非更新参数被吸收。微调仅在应对新的、未见过的实体时才需进行，而非针对每个新任务。我们表明，检索能超越特定骨架网络（包括标准VLA策略）改进策略性能，但其效应在基于视频生成的世界动作模型（WAM）Cosmos Policy中尤为显著。在此设定下，检索提供了粗粒度的任务进展信息，而WAM的未来图像目标则提供了额外的视觉一致性信号，从而强化了基于检索的条件化动作。在PushT任务中，我们研究了检索如何为跨实体泛化至未见目标角度提供可复用的高层运动先验；在RoboTwin 2.0上，我们的方法在未见任务上优于跨实体基线，并进一步在真实机器人上进行了方法验证。

相关内容

实体

关注 12

实体（entity）是有可区别性且独立存在的某种事物，但它不需要是物质上的存在。尤其是抽象和法律拟制也通常被视为实体。实体可被看成是一包含有子集的集合。在哲学里，这种集合被称为客体。实体可被使用来指涉某个可能是人、动物、植物或真菌等不会思考的生命、无生命物体或信念等的事物。在这一方面，实体可以被视为一全包的词语。有时，实体被当做本质的广义，不论即指的是否为物质上的存在，如时常会指涉到的无物质形式的实体－语言。更有甚者，实体有时亦指存在或本质本身。在法律上，实体是指能具有权利和义务的事物。这通常是指法人，但也包括自然人。

[ICML 2026] 看见的还是思考的？用奖励机制区分“看错”与“想错”：视觉语言模型奖励感知

专知会员服务

10+阅读 · 5月15日

机器人领域中的视觉-语言-动作模型：数据集、基准测试与数据引擎综述

专知会员服务

14+阅读 · 4月29日

【AAAI2026】TOFA：面向视觉-语言模型的免训练一次性联邦自适应方法

专知会员服务

13+阅读 · 2025年11月23日

面向具身操作的高效视觉–语言–动作模型：系统综述

专知会员服务

26+阅读 · 2025年10月22日