PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs

Soroush Nasiriany,Fei Xia,Wenhao Yu,Ted Xiao,Jacky Liang,Ishita Dasgupta,Annie Xie,Danny Driess,Ayzaan Wahid,Zhuo Xu,Quan Vuong,Tingnan Zhang,Tsang-Wei Edward Lee,Kuang-Huei Lee,Peng Xu,Sean Kirmani,Yuke Zhu,Andy Zeng,Karol Hausman,Nicolas Heess,Chelsea Finn,Sergey Levine,Brian Ichter

Vision language models (VLMs) have shown impressive capabilities across a variety of tasks, from logical reasoning to visual understanding. This opens the door to richer interaction with the world, for example robotic control. However, VLMs produce only textual outputs, while robotic control and other spatial tasks require outputting continuous coordinates, actions, or trajectories. How can we enable VLMs to handle such settings without fine-tuning on task-specific data? In this paper, we propose a novel visual prompting approach for VLMs that we call Prompting with Iterative Visual Optimization (PIVOT), which casts tasks as iterative visual question answering. In each iteration, the image is annotated with a visual representation of proposals that the VLM can refer to (e.g., candidate robot actions, localizations, or trajectories). The VLM then selects the best ones for the task. These proposals are iteratively refined, allowing the VLM to eventually zero in on the best available answer. We investigate PIVOT on real-world robotic navigation, real-world manipulation from images, instruction following in simulation, and additional spatial inference tasks such as localization. We find, perhaps surprisingly, that our approach enables zero-shot control of robotic systems without any robot training data, navigation in a variety of environments, and other capabilities. Although current performance is far from perfect, our work highlights potentials and limitations of this new regime and shows a promising approach for Internet-Scale VLMs in robotic and spatial reasoning domains. Website: pivot-prompt.github.io and HuggingFace: https://huggingface.co/spaces/pivot-prompt/pivot-prompt-demo.

翻译：视觉语言模型（VLM）在从逻辑推理到视觉理解的各类任务中展现出卓越能力，这为更丰富的世界交互（如机器人控制）开启了可能性。然而，VLM仅产生文本输出，而机器人控制及其他空间任务需要输出连续坐标、动作或轨迹。如何在不针对特定任务数据进行微调的情况下，使VLM能够处理此类场景？本文提出一种名为"迭代视觉优化提示"（PIVOT）的新型VLM视觉提示方法，将任务转化为迭代式视觉问答。在每轮迭代中，图像会被标注上VLM可参考的提议的视觉表征（例如候选机器人动作、定位结果或轨迹），随后VLM选取最优提议执行任务。这些提议经过迭代优化，使VLM最终锁定最佳答案。我们在真实机器人导航、基于图像的实体操控、仿真指令跟随以及定位等空间推理任务中验证了PIVOT。令人惊讶的是，我们发现该方法无需任何机器人训练数据即可实现机器人系统的零样本控制、多环境导航及其他能力。尽管当前性能远非完美，但本工作揭示了这一新范式的潜力与局限，为互联网规模VLM在机器人与空间推理领域的应用展示了有前景的方向。网站：pivot-prompt.github.io；HuggingFace：https://huggingface.co/spaces/pivot-prompt/pivot-prompt-demo。