Autonomous navigation in open-world outdoor environments faces challenges in integrating dynamic conditions, long-distance spatial reasoning, and semantic understanding. Traditional methods struggle to balance local planning, global planning, and semantic task execution, while existing large language models (LLMs) enhance semantic comprehension but lack spatial reasoning capabilities. Although diffusion models excel in local optimization, they fall short in large-scale long-distance navigation. To address these gaps, this paper proposes KiteRunner, a language-driven cooperative local-global navigation strategy that combines UAV orthophoto-based global planning with diffusion model-driven local path generation for long-distance navigation in open-world scenarios. Our method innovatively leverages real-time UAV orthophotography to construct a global probability map, providing traversability guidance for the local planner, while integrating large models like CLIP and GPT to interpret natural language instructions. Experiments demonstrate that KiteRunner achieves 5.6% and 12.8% improvements in path efficiency over state-of-the-art methods in structured and unstructured environments, respectively, with significant reductions in human interventions and execution time.
翻译:开放世界户外环境中的自主导航面临整合动态条件、长距离空间推理和语义理解的挑战。传统方法难以平衡局部规划、全局规划和语义任务执行,而现有的大语言模型(LLMs)虽增强了语义理解能力,却缺乏空间推理能力。尽管扩散模型在局部优化方面表现出色,但在大规模长距离导航中仍显不足。为弥补这些不足,本文提出KiteRunner,一种语言驱动的协同局部-全局导航策略,该方法结合基于无人机正射影像的全局规划与扩散模型驱动的局部路径生成,用于开放世界场景中的长距离导航。我们的方法创新性地利用实时无人机正射摄影构建全局概率地图,为局部规划器提供可通行性引导,同时集成CLIP、GPT等大模型来解析自然语言指令。实验表明,在结构化与非结构化环境中,KiteRunner的路径效率分别比现有先进方法提升了5.6%和12.8%,同时显著减少了人工干预和执行时间。