Robot manipulation relies on accurately predicting contact points and end-effector directions to ensure successful operation. However, learning-based robot manipulation, trained on a limited category within a simulator, often struggles to achieve generalizability, especially when confronted with extensive categories. Therefore, we introduce an innovative approach for robot manipulation that leverages the robust reasoning capabilities of Multimodal Large Language Models (MLLMs) to enhance the stability and generalization of manipulation. By fine-tuning the injected adapters, we preserve the inherent common sense and reasoning ability of the MLLMs while equipping them with the ability for manipulation. The fundamental insight lies in the introduced fine-tuning paradigm, encompassing object category understanding, affordance prior reasoning, and object-centric pose prediction to stimulate the reasoning ability of MLLM in manipulation. During inference, our approach utilizes an RGB image and text prompt to predict the end effector's pose in chain of thoughts. After the initial contact is established, an active impedance adaptation policy is introduced to plan the upcoming waypoints in a closed-loop manner. Moreover, in real world, we design a test-time adaptation (TTA) strategy for manipulation to enable the model better adapt to the current real-world scene configuration. Experiments in simulator and real-world show the promising performance of ManipLLM. More details and demonstrations can be found at https://sites.google.com/view/manipllm.
翻译:摘要:机器人操作依赖于精确预测接触点和末端执行器方向以确保成功操作。然而,基于学习的机器人操作通常在模拟器中训练于有限类别对象,难以实现泛化能力,尤其是在面对广泛类别时。为此,我们提出一种创新机器人操作方法,利用多模态大语言模型(MLLMs)的强大推理能力来增强操作的稳定性和泛化性。通过微调注入的适配器,我们保留MLLMs固有的常识和推理能力,同时赋予其操作能力。核心见解在于引入的微调范式,包括物体类别理解、可操作性先验推理和以物体为中心的位姿预测,以激发MLLM在操作中的推理能力。在推理阶段,我们的方法利用RGB图像和文本提示,以思维链方式预测末端执行器的位姿。初始接触建立后,引入主动阻抗自适应策略,以闭环方式规划后续路径点。此外,在真实场景中,我们设计了一种测试时自适应(TTA)机制,使模型能更好地适应当前真实场景配置。模拟器和真实环境中的实验展示了ManipLLM的优异性能。更多细节和演示请参见https://sites.google.com/view/manipllm。