Large language models (LLMs) have undergone significant expansion and have been increasingly integrated across various domains. Notably, in the realm of robot task planning, LLMs harness their advanced reasoning and language comprehension capabilities to formulate precise and efficient action plans based on natural language instructions. However, for embodied tasks, where robots interact with complex environments, text-only LLMs often face challenges due to a lack of compatibility with robotic visual perception. This study provides a comprehensive overview of the emerging integration of LLMs and multimodal LLMs into various robotic tasks. Additionally, we propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions. Our results, based on diverse datasets, indicate that GPT-4V effectively enhances robot performance in embodied tasks. This extensive survey and evaluation of LLMs and multimodal LLMs across a variety of robotic tasks enriches the understanding of LLM-centric embodied intelligence and provides forward-looking insights toward bridging the gap in Human-Robot-Environment interaction.
翻译:大型语言模型(LLMs)经历了显著扩展,并日益广泛地集成到各个领域。尤其在机器人任务规划领域,LLMs利用其先进的推理和语言理解能力,能够基于自然语言指令制定精确高效的行动方案。然而,在涉及机器人与复杂环境交互的具体任务中,仅依赖文本的LLMs常因缺乏与机器人视觉感知的兼容性而面临挑战。本研究全面概述了LLMs及多模态LLMs与各类机器人任务的新兴融合趋势。此外,我们提出了一种框架,利用多模态GPT-4V通过结合自然语言指令与机器人视觉感知来增强具体任务规划。基于多样化数据集的结果表明,GPT-4V有效提升了机器人在具体任务中的表现。这项对LLMs及多模态LLMs在各类机器人任务中的广泛调查与评估,深化了以LLM为核心的具体智能的理解,并为弥合人-机器人-环境交互中的鸿沟提供了前瞻性见解。