室内无人机：连续室内环境中视觉语言无人机导航的基准测试 (IndoorUAV: Benchmarking Vision-Language UAV Navigation in Continuous Indoor Environments)

Vision-Language Navigation (VLN) enables agents to navigate in complex environments by following natural language instructions grounded in visual observations. Although most existing work has focused on ground-based robots or outdoor Unmanned Aerial Vehicles (UAVs), indoor UAV-based VLN remains underexplored, despite its relevance to real-world applications such as inspection, delivery, and search-and-rescue in confined spaces. To bridge this gap, we introduce \textbf{IndoorUAV}, a novel benchmark and method specifically tailored for VLN with indoor UAVs. We begin by curating over 1,000 diverse and structurally rich 3D indoor scenes from the Habitat simulator. Within these environments, we simulate realistic UAV flight dynamics to collect diverse 3D navigation trajectories manually, further enriched through data augmentation techniques. Furthermore, we design an automated annotation pipeline to generate natural language instructions of varying granularity for each trajectory. This process yields over 16,000 high-quality trajectories, comprising the \textbf{IndoorUAV-VLN} subset, which focuses on long-horizon VLN. To support short-horizon planning, we segment long trajectories into sub-trajectories by selecting semantically salient keyframes and regenerating concise instructions, forming the \textbf{IndoorUAV-VLA} subset. Finally, we introduce \textbf{IndoorUAV-Agent}, a novel navigation model designed for our benchmark, leveraging task decomposition and multimodal reasoning. We hope IndoorUAV serves as a valuable resource to advance research on vision-language embodied AI in the indoor aerial navigation domain.

翻译：视觉语言导航（VLN）使智能体能够通过遵循基于视觉观察的自然语言指令在复杂环境中导航。尽管现有研究大多集中于地面机器人或室外无人机，室内无人机视觉语言导航在实际应用中（如受限空间内的巡检、配送与搜救）具有重要意义，却仍未得到充分探索。为填补这一空白，本文提出**IndoorUAV**——一个专门针对室内无人机视觉语言导航设计的新基准与方法。我们首先从Habitat模拟器中筛选出超过1,000个多样化且结构丰富的3D室内场景，并在这些环境中模拟真实无人机飞行动力学以人工采集多样化的3D导航轨迹，进而通过数据增强技术进行扩充。此外，我们设计了自动化标注流程，为每条轨迹生成不同粒度的自然语言指令。该流程最终产出超过16,000条高质量轨迹，构成专注于长时程视觉语言导航的**IndoorUAV-VLN**子集。为支持短时程规划，我们通过选取语义关键帧将长轨迹分割为子轨迹并重新生成简洁指令，形成**IndoorUAV-VLA**子集。最后，我们提出**IndoorUAV-Agent**——一种专为本基准设计的导航模型，其利用任务分解与多模态推理能力。我们希望IndoorUAV能成为推动室内空中导航领域视觉语言具身人工智能研究的重要资源。