Robot manipulation policies have shown unsatisfactory action performance when confronted with novel task or object instances. Hence, the capability to automatically detect and self-correct failure action is essential for a practical robotic system. Recently, Multimodal Large Language Models (MLLMs) have shown promise in visual instruction following and demonstrated strong reasoning abilities in various tasks. To unleash general MLLMs as an end-to-end robotic agent, we introduce a Self-Corrected (SC)-MLLM, equipping our model not only to predict end-effector poses but also to autonomously recognize and correct failure actions. Specifically, we first conduct parameter-efficient fine-tuning to empower MLLM with pose prediction ability, which is reframed as a language modeling problem. When facing execution failures, our model learns to identify low-level action error causes (i.e., position and rotation errors) and adaptively seeks prompt feedback from experts. Based on the feedback, SC-MLLM rethinks the current failure scene and generates the corrected actions. Furthermore, we design a continuous policy learning method for successfully corrected samples, enhancing the model's adaptability to the current scene configuration and reducing the frequency of expert intervention. To evaluate our SC-MLLM, we conduct extensive experiments in both simulation and real-world settings. SC-MLLM agent significantly improve manipulation accuracy compared to previous state-of-the-art robotic MLLM (ManipLLM), increasing from 57\% to 79\% on seen object categories and from 47\% to 69\% on unseen novel categories.
翻译:当面对新颖任务或物体实例时,机器人操作策略常表现出不尽人意的动作性能。因此,自动检测并自我校正失败动作的能力对于实用的机器人系统至关重要。近年来,多模态大语言模型在视觉指令跟随方面展现出潜力,并在多种任务中表现出强大的推理能力。为释放通用MLLM作为端到端机器人代理的潜力,我们提出了一种自校正多模态大语言模型,该模型不仅能预测末端执行器位姿,还能自主识别并校正失败动作。具体而言,我们首先通过参数高效微调赋予MLLM位姿预测能力,该任务被重新构建为语言建模问题。当遭遇执行失败时,我们的模型能够识别底层动作错误原因(即位置与旋转误差),并自适应地向专家寻求提示反馈。基于反馈,SC-MLLM重新审视当前失败场景并生成校正后的动作。此外,我们为成功校正的样本设计了持续策略学习方法,以增强模型对当前场景配置的适应能力,并降低专家干预频率。为评估SC-MLLM,我们在仿真与真实场景中进行了广泛实验。相比先前最先进的机器人MLLM,SC-MLLM代理显著提升了操作精度:在已见物体类别上从57%提升至79%,在未见新类别上从47%提升至69%。