Recently emerged Vision-and-Language Navigation (VLN) tasks have drawn significant attention in both computer vision and natural language processing communities. Existing VLN tasks are built for agents that navigate on the ground, either indoors or outdoors. However, many tasks require intelligent agents to carry out in the sky, such as UAV-based goods delivery, traffic/security patrol, and scenery tour, to name a few. Navigating in the sky is more complicated than on the ground because agents need to consider the flying height and more complex spatial relationship reasoning. To fill this gap and facilitate research in this field, we propose a new task named AerialVLN, which is UAV-based and towards outdoor environments. We develop a 3D simulator rendered by near-realistic pictures of 25 city-level scenarios. Our simulator supports continuous navigation, environment extension and configuration. We also proposed an extended baseline model based on the widely-used cross-modal-alignment (CMA) navigation methods. We find that there is still a significant gap between the baseline model and human performance, which suggests AerialVLN is a new challenging task. Dataset and code is available at https://github.com/AirVLN/AirVLN.
翻译:近年来兴起的视觉-语言导航(VLN)任务在计算机视觉和自然语言处理领域引起了广泛关注。现有VLN任务面向在地面(室内或室外)导航的智能体。然而,许多任务要求智能体在空中执行,例如基于无人机的货物配送、交通/安全巡逻以及风景导览等。空中导航比地面导航更为复杂,因为智能体需要考虑飞行高度以及更复杂的空间关系推理。为填补这一空白并促进该领域研究,我们提出了一项名为AerialVLN的新任务,该任务基于无人机并面向室外环境。我们开发了一个由25个城市场景的近真实图片渲染的3D模拟器。该模拟器支持连续导航、环境扩展与配置。我们还基于广泛使用的跨模态对齐(CMA)导航方法,提出了一个扩展基线模型。实验发现,基线模型与人类性能之间仍存在显著差距,表明AerialVLN是一项极具挑战性的新任务。数据集与代码已开源至https://github.com/AirVLN/AirVLN。