Although Model Predictive Control (MPC) can effectively predict the future states of a system and thus is widely used in robotic manipulation tasks, it does not have the capability of environmental perception, leading to the failure in some complex scenarios. To address this issue, we introduce Vision-Language Model Predictive Control (VLMPC), a robotic manipulation framework which takes advantage of the powerful perception capability of vision language model (VLM) and integrates it with MPC. Specifically, we propose a conditional action sampling module which takes as input a goal image or a language instruction and leverages VLM to sample a set of candidate action sequences. Then, a lightweight action-conditioned video prediction model is designed to generate a set of future frames conditioned on the candidate action sequences. VLMPC produces the optimal action sequence with the assistance of VLM through a hierarchical cost function that formulates both pixel-level and knowledge-level consistence between the current observation and the goal image. We demonstrate that VLMPC outperforms the state-of-the-art methods on public benchmarks. More importantly, our method showcases excellent performance in various real-world tasks of robotic manipulation. Code is available at~\url{https://github.com/PPjmchen/VLMPC}.
翻译:尽管模型预测控制(Model Predictive Control, MPC)能够有效预测系统的未来状态,因而广泛应用于机器人操作任务,但其缺乏环境感知能力,导致在某些复杂场景中失效。为解决这一问题,我们提出了视觉语言模型预测控制(Vision-Language Model Predictive Control, VLMPC),这是一种利用视觉语言模型(Vision Language Model, VLM)强大感知能力并将其与MPC相结合的机器人操作框架。具体而言,我们提出了一个条件动作采样模块,该模块以目标图像或语言指令作为输入,并利用VLM采样一组候选动作序列。随后,设计了一个轻量级的动作条件视频预测模型,用于在候选动作序列条件下生成一组未来帧。VLMPC通过一个分层代价函数,在VLM的辅助下生成最优动作序列,该函数同时考虑了当前观测与目标图像之间的像素级和知识级一致性。实验证明,VLMPC在公开基准测试中优于现有最先进方法。更重要的是,我们的方法在多种现实世界的机器人操作任务中展现出卓越性能。代码发布于~\url{https://github.com/PPjmchen/VLMPC}。