Large Language Models (LLMs) have demonstrated potential in Vision-and-Language Navigation (VLN) tasks, yet current applications face challenges. While LLMs excel in general conversation scenarios, they struggle with specialized navigation tasks, yielding suboptimal performance compared to specialized VLN models. We introduce FLAME (FLAMingo-Architected Embodied Agent), a novel Multimodal LLM-based agent and architecture designed for urban VLN tasks that efficiently handles multiple observations. Our approach implements a three-phase tuning technique for effective adaptation to navigation tasks, including single perception tuning for street view description, multiple perception tuning for trajectory summarization, and end-to-end training on VLN datasets. The augmented datasets are synthesized automatically. Experimental results demonstrate FLAME's superiority over existing methods, surpassing state-of-the-art methods by a 7.3% increase in task completion rate on Touchdown dataset. This work showcases the potential of Multimodal LLMs (MLLMs) in complex navigation tasks, representing an advancement towards practical applications of MLLMs in embodied AI. Project page: https://flame-sjtu.github.io
翻译:大语言模型(LLM)在视觉语言导航任务中展现出潜力,但当前应用仍面临挑战。尽管LLM在通用对话场景中表现优异,其在专业导航任务中的性能仍逊于专用VLN模型,存在优化空间。本文提出FLAME(基于FLAMingo架构的具身智能体),一种专为城市场景VLN任务设计的新型多模态LLM智能体架构,可高效处理多观测输入。该方法采用三阶段调优技术实现导航任务的有效适配,包括:面向街景描述的单感知调优、面向轨迹总结的多感知调优,以及在VLN数据集上的端到端训练。增强数据集通过自动合成生成。实验结果表明FLAME优于现有方法,在Touchdown数据集上以任务完成率提升7.3%的性能超越当前最优方法。本研究展示了多模态大语言模型在复杂导航任务中的应用潜力,标志着MLLM在具身智能领域的实用化进程向前迈进。项目主页:https://flame-sjtu.github.io