Self-Corrected Multimodal Large Language Model for End-to-End Robot Manipulation

Robot manipulation policies have shown unsatisfactory action performance when confronted with novel task or object instances. Hence, the capability to automatically detect and self-correct failure action is essential for a practical robotic system. Recently, Multimodal Large Language Models (MLLMs) have shown promise in visual instruction following and demonstrated strong reasoning abilities in various tasks. To unleash general MLLMs as an end-to-end robotic agent, we introduce a Self-Corrected (SC)-MLLM, equipping our model not only to predict end-effector poses but also to autonomously recognize and correct failure actions. Specifically, we first conduct parameter-efficient fine-tuning to empower MLLM with pose prediction ability, which is reframed as a language modeling problem. When facing execution failures, our model learns to identify low-level action error causes (i.e., position and rotation errors) and adaptively seeks prompt feedback from experts. Based on the feedback, SC-MLLM rethinks the current failure scene and generates the corrected actions. Furthermore, we design a continuous policy learning method for successfully corrected samples, enhancing the model's adaptability to the current scene configuration and reducing the frequency of expert intervention. To evaluate our SC-MLLM, we conduct extensive experiments in both simulation and real-world settings. SC-MLLM agent significantly improve manipulation accuracy compared to previous state-of-the-art robotic MLLM (ManipLLM), increasing from 57\% to 79\% on seen object categories and from 47\% to 69\% on unseen novel categories.

翻译：当面对新颖任务或物体实例时，机器人操作策略常表现出不尽人意的动作性能。因此，自动检测并自我校正失败动作的能力对于实用的机器人系统至关重要。近年来，多模态大语言模型在视觉指令跟随方面展现出潜力，并在多种任务中表现出强大的推理能力。为释放通用MLLM作为端到端机器人代理的潜力，我们提出了一种自校正多模态大语言模型，该模型不仅能预测末端执行器位姿，还能自主识别并校正失败动作。具体而言，我们首先通过参数高效微调赋予MLLM位姿预测能力，该任务被重新构建为语言建模问题。当遭遇执行失败时，我们的模型能够识别底层动作错误原因（即位置与旋转误差），并自适应地向专家寻求提示反馈。基于反馈，SC-MLLM重新审视当前失败场景并生成校正后的动作。此外，我们为成功校正的样本设计了持续策略学习方法，以增强模型对当前场景配置的适应能力，并降低专家干预频率。为评估SC-MLLM，我们在仿真与真实场景中进行了广泛实验。相比先前最先进的机器人MLLM，SC-MLLM代理显著提升了操作精度：在已见物体类别上从57%提升至79%，在未见新类别上从47%提升至69%。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日