This paper introduces VLN-Pilot, a novel framework in which a large Vision-and-Language Model (VLLM) assumes the role of a human pilot for indoor drone navigation. By leveraging the multimodal reasoning abilities of VLLMs, VLN-Pilot interprets free-form natural language instructions and grounds them in visual observations to plan and execute drone trajectories in GPS-denied indoor environments. Unlike traditional rule-based or geometric path-planning approaches, our framework integrates language-driven semantic understanding with visual perception, enabling context-aware, high-level flight behaviors with minimal task-specific engineering. VLN-Pilot supports fully autonomous instruction-following for drones by reasoning about spatial relationships, obstacle avoidance, and dynamic reactivity to unforeseen events. We validate our framework on a custom photorealistic indoor simulation benchmark and demonstrate the ability of the VLLM-driven agent to achieve high success rates on complex instruction-following tasks, including long-horizon navigation with multiple semantic targets. Experimental results highlight the promise of replacing remote drone pilots with a language-guided autonomous agent, opening avenues for scalable, human-friendly control of indoor UAVs in tasks such as inspection, search-and-rescue, and facility monitoring. Our results suggest that VLLM-based pilots may dramatically reduce operator workload while improving safety and mission flexibility in constrained indoor environments.
翻译:本文介绍了VLN-Pilot,一种新颖的框架,其中大型视觉语言模型(VLLM)扮演室内无人机导航的人类飞行员角色。通过利用VLLM的多模态推理能力,VLN-Pilot能够解析自由形式的自然语言指令,并将其与视觉观测相结合,从而在无GPS的室内环境中规划并执行无人机轨迹。与传统的基于规则或几何的路径规划方法不同,我们的框架将语言驱动的语义理解与视觉感知相融合,实现了具有情境感知能力的高层飞行行为,且只需极少针对特定任务的工程调整。VLN-Pilot通过对空间关系、避障以及对突发事件的动态响应进行推理,支持无人机完全自主地遵循指令执行任务。我们在一个定制的、具有照片级真实感的室内仿真基准上验证了该框架,并展示了由VLLM驱动的智能体在复杂指令跟随任务(包括包含多个语义目标的远距离导航)上实现高成功率的能力。实验结果凸显了用语言引导的自主智能体替代远程无人机飞行员的前景,为室内无人机在巡检、搜救和设施监控等任务中实现可扩展、人性化的控制开辟了新途径。我们的研究结果表明,基于VLLM的飞行员可以显著降低操作员的工作负担,同时在受限的室内环境中提高安全性和任务灵活性。