AIC MLLM: Autonomous Interactive Correction MLLM for Robust Robotic Manipulation

The ability to reflect on and correct failures is crucial for robotic systems to interact stably with real-life objects.Observing the generalization and reasoning capabilities of Multimodal Large Language Models (MLLMs), previous approaches have aimed to utilize these models to enhance robotic systems accordingly.However, these methods typically focus on high-level planning corrections using an additional MLLM, with limited utilization of failed samples to correct low-level contact poses. To address this gap, we propose an Autonomous Interactive Correction (AIC) MLLM, which makes use of previous low-level interaction experiences to correct SE(3) pose predictions. Specifically, AIC MLLM is initially fine-tuned to acquire both pose prediction and feedback prompt comprehension abilities.We carefully design two types of prompt instructions through interactions with objects: 1) visual masks to highlight unmovable parts for position correction, and 2)textual descriptions to indicate potential directions for rotation correction.During inference, a Feedback Information Extraction module is introduced to recognize the failure cause, allowing AIC MLLM to adaptively correct the pose prediction using the corresponding prompts.To further enhance manipulation stability, we devise a Test Time Adaptation strategy that enables AIC MLLM to better adapt to the current scene configuration.Finally, extensive experiments are conducted in both simulated and real-world environments to evaluate the proposed method. The results demonstrate that our AIC MLLM can efficiently correct failure samples by leveraging interaction experience prompts.Real-world demonstration can be found at https://sites.google.com/view/aic-mllm

翻译：反思并修正失败的能力对于机器人系统与现实物体进行稳定交互至关重要。观察到多模态大语言模型（MLLMs）的泛化与推理能力，先前的研究尝试利用这些模型来增强机器人系统。然而，这些方法通常侧重于使用额外的 MLLM 进行高层规划校正，对失败样本的利用有限，难以修正低层接触位姿。为弥补这一不足，我们提出了一种自主交互校正（AIC）MLLM，它利用先前的低层交互经验来校正 SE(3) 位姿预测。具体而言，AIC MLLM 首先通过微调以获得位姿预测和反馈提示理解能力。我们通过与物体交互精心设计了两类提示指令：1）用于突出不可移动部分以进行位置校正的视觉掩码，以及 2）用于指示潜在旋转校正方向的文本描述。在推理过程中，我们引入了一个反馈信息提取模块来识别失败原因，使 AIC MLLM 能够利用相应的提示自适应地修正位姿预测。为了进一步提升操作稳定性，我们设计了一种测试时自适应策略，使 AIC MLLM 能更好地适应当前场景配置。最后，我们在仿真和真实环境中进行了大量实验以评估所提方法。结果表明，我们的 AIC MLLM 能够通过利用交互经验提示高效地校正失败样本。真实世界演示可见于 https://sites.google.com/view/aic-mllm。