VLMPC: Vision-Language Model Predictive Control for Robotic Manipulation

Although Model Predictive Control (MPC) can effectively predict the future states of a system and thus is widely used in robotic manipulation tasks, it does not have the capability of environmental perception, leading to the failure in some complex scenarios. To address this issue, we introduce Vision-Language Model Predictive Control (VLMPC), a robotic manipulation framework which takes advantage of the powerful perception capability of vision language model (VLM) and integrates it with MPC. Specifically, we propose a conditional action sampling module which takes as input a goal image or a language instruction and leverages VLM to sample a set of candidate action sequences. Then, a lightweight action-conditioned video prediction model is designed to generate a set of future frames conditioned on the candidate action sequences. VLMPC produces the optimal action sequence with the assistance of VLM through a hierarchical cost function that formulates both pixel-level and knowledge-level consistence between the current observation and the goal image. We demonstrate that VLMPC outperforms the state-of-the-art methods on public benchmarks. More importantly, our method showcases excellent performance in various real-world tasks of robotic manipulation. Code is available at~\url{https://github.com/PPjmchen/VLMPC}.

翻译：尽管模型预测控制（Model Predictive Control, MPC）能够有效预测系统的未来状态，因而广泛应用于机器人操作任务，但其缺乏环境感知能力，导致在某些复杂场景中失效。为解决这一问题，我们提出了视觉语言模型预测控制（Vision-Language Model Predictive Control, VLMPC），这是一种利用视觉语言模型（Vision Language Model, VLM）强大感知能力并将其与MPC相结合的机器人操作框架。具体而言，我们提出了一个条件动作采样模块，该模块以目标图像或语言指令作为输入，并利用VLM采样一组候选动作序列。随后，设计了一个轻量级的动作条件视频预测模型，用于在候选动作序列条件下生成一组未来帧。VLMPC通过一个分层代价函数，在VLM的辅助下生成最优动作序列，该函数同时考虑了当前观测与目标图像之间的像素级和知识级一致性。实验证明，VLMPC在公开基准测试中优于现有最先进方法。更重要的是，我们的方法在多种现实世界的机器人操作任务中展现出卓越性能。代码发布于~\url{https://github.com/PPjmchen/VLMPC}。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日