Recent advances in large Vision-Language Models (VLMs) have provided rich semantic understanding that empowers drones to search for open-set objects via natural language instructions. However, prior systems struggle to integrate VLMs into practical aerial systems due to orders-of-magnitude frequency mismatch between VLM inference and real-time planning, as well as VLMs' limited 3D scene understanding. They also lack a unified mechanism to balance semantic guidance with motion efficiency in large-scale environments. To address these challenges, we present AirHunt, an aerial object navigation system that efficiently locates open-set objects with zero-shot generalization in outdoor environments by seamlessly fusing VLM semantic reasoning with continuous path planning. AirHunt features a dual-pathway asynchronous architecture that establishes a synergistic interface between VLM reasoning and path planning, enabling continuous flight with adaptive semantic guidance that evolves through motion. Moreover, we propose an active dual-task reasoning module that exploits geometric and semantic redundancy to enable selective VLM querying, and a semantic-geometric coherent planning module that dynamically reconciles semantic priorities and motion efficiency in a unified framework, enabling seamless adaptation to environmental heterogeneity. We evaluate AirHunt across diverse object navigation tasks and environments, demonstrating a higher success rate with lower navigation error and reduced flight time compared to state-of-the-art methods. Real-world experiments further validate AirHunt's practical capability in complex and challenging environments. Code and dataset will be made publicly available before publication.
翻译:近年来,大型视觉语言模型(VLMs)的进展提供了丰富的语义理解能力,使无人机能够通过自然语言指令搜索开放集目标。然而,现有系统难以将VLMs集成到实际空中系统中,原因在于VLM推理与实时规划之间存在数量级的频率失配,且VLMs对三维场景的理解有限。这些系统还缺乏在大规模环境中平衡语义引导与运动效率的统一机制。为应对这些挑战,我们提出了AirHunt——一种空中目标导航系统,通过无缝融合VLM语义推理与连续路径规划,在户外环境中以零样本泛化能力高效定位开放集目标。AirHunt采用双通路异步架构,在VLM推理与路径规划之间建立协同接口,实现持续飞行,并通过运动过程演化的自适应语义引导。此外,我们提出了主动双任务推理模块,利用几何与语义冗余实现选择性VLM查询;以及语义-几何一致规划模块,在统一框架中动态协调语义优先级与运动效率,从而无缝适应环境异质性。我们在多样化的目标导航任务与环境中评估AirHunt,结果表明相较于现有最优方法,本系统具有更高的成功率、更低的导航误差与更短的飞行时间。真实世界实验进一步验证了AirHunt在复杂挑战性环境中的实际能力。代码与数据集将于发表前公开。