Recent advances in Large Language Models (LLMs) have showcased their remarkable reasoning capabilities, making them influential across various fields. However, in robotics, their use has primarily been limited to manipulation planning tasks due to their inherent textual output. This paper addresses this limitation by investigating the potential of adopting the reasoning ability of LLMs for generating numerical predictions in robotics tasks, specifically for robotic grasping. We propose Reasoning Tuning, a novel method that integrates a reasoning phase before prediction during training, leveraging the extensive prior knowledge and advanced reasoning abilities of LLMs. This approach enables LLMs, notably with multi-modal capabilities, to generate accurate numerical outputs like grasp poses that are context-aware and adaptable through conversations. Additionally, we present the Reasoning Tuning VLM Grasp dataset, carefully curated to facilitate the adaptation of LLMs to robotic grasping. Extensive validation on both grasping datasets and real-world experiments underscores the adaptability of multi-modal LLMs for numerical prediction tasks in robotics. This not only expands their applicability but also bridges the gap between text-based planning and direct robot control, thereby maximizing the potential of LLMs in robotics.
翻译:近年来,大语言模型(LLMs)的进展展示了其卓越的推理能力,使其在各个领域产生了深远影响。然而,在机器人学中,由于其固有的文本输出特性,其应用主要局限于操作规划任务。本文通过研究将LLMs的推理能力应用于机器人任务中生成数值预测的潜力,特别是针对机器人抓取,以解决这一局限性。我们提出了推理调优,这是一种新颖的方法,在训练过程中于预测之前整合一个推理阶段,从而利用LLMs广泛的前验知识和高级推理能力。这种方法使得LLMs,特别是具备多模态能力的模型,能够生成精确的数值输出,例如具备上下文感知能力且可通过对话进行调整的抓取位姿。此外,我们提出了推理调优VLM抓取数据集,该数据集经过精心策划,旨在促进LLMs适应机器人抓取任务。在抓取数据集和真实世界实验上的广泛验证,突显了多模态LLMs在机器人学数值预测任务中的适应性。这不仅扩展了其应用范围,还弥合了基于文本的规划与直接机器人控制之间的差距,从而最大限度地发挥了LLMs在机器人学中的潜力。