Foundation models trained on internet-scale data, such as Vision-Language Models (VLMs), excel at performing tasks involving common sense, such as visual question answering. Despite their impressive capabilities, these models cannot currently be directly applied to challenging robot manipulation problems that require complex and precise continuous reasoning. Task and Motion Planning (TAMP) systems can control high-dimensional continuous systems over long horizons through combining traditional primitive robot operations. However, these systems require detailed model of how the robot can impact its environment, preventing them from directly interpreting and addressing novel human objectives, for example, an arbitrary natural language goal. We propose deploying VLMs within TAMP systems by having them generate discrete and continuous language-parameterized constraints that enable TAMP to reason about open-world concepts. Specifically, we propose algorithms for VLM partial planning that constrain a TAMP system's discrete temporal search and VLM continuous constraints interpretation to augment the traditional manipulation constraints that TAMP systems seek to satisfy. We demonstrate our approach on two robot embodiments, including a real world robot, across several manipulation tasks, where the desired objectives are conveyed solely through language.
翻译:在互联网规模数据上训练的基础模型,如视觉语言模型(VLMs),擅长执行涉及常识的任务,例如视觉问答。尽管这些模型能力令人印象深刻,但目前尚无法直接应用于需要复杂且精确连续推理的挑战性机器人操作问题。任务与运动规划(TAMP)系统通过结合传统原始机器人操作,能够控制高维连续系统进行长时程规划。然而,这些系统需要机器人如何影响其环境的详细模型,这阻碍了它们直接解释和处理新颖的人类目标,例如任意的自然语言目标。我们提出在TAMP系统中部署VLMs,使其生成离散和连续的语言参数化约束,从而使TAMP能够对开放世界概念进行推理。具体而言,我们提出了VLM部分规划算法,以约束TAMP系统的离散时序搜索,以及VLM连续约束解释算法,以增强TAMP系统旨在满足的传统操作约束。我们在两种机器人实体(包括真实世界机器人)上展示了我们的方法,并在多个操作任务中验证了其有效性,其中期望目标仅通过语言传达。