The ability to converse with humans and follow natural language commands is crucial for intelligent unmanned aerial vehicles (a.k.a. drones). It can relieve people's burden of holding a controller all the time, allow multitasking, and make drone control more accessible for people with disabilities or with their hands occupied. To this end, we introduce Aerial Vision-and-Dialog Navigation (AVDN), to navigate a drone via natural language conversation. We build a drone simulator with a continuous photorealistic environment and collect a new AVDN dataset of over 3k recorded navigation trajectories with asynchronous human-human dialogs between commanders and followers. The commander provides initial navigation instruction and further guidance by request, while the follower navigates the drone in the simulator and asks questions when needed. During data collection, followers' attention on the drone's visual observation is also recorded. Based on the AVDN dataset, we study the tasks of aerial navigation from (full) dialog history and propose an effective Human Attention Aided Transformer model (HAA-Transformer), which learns to predict both navigation waypoints and human attention.
翻译:与人类对话并遵循自然语言指令的能力对于智能无人飞行器(即无人机)至关重要。这能减轻人们持续操控控制器的负担,支持多任务处理,并让残障人士或双手被占用者更易操控无人机。为此,我们提出空中视觉与对话导航(AVDN),通过自然语言对话引导无人机飞行。我们构建了一个具有连续逼真环境的无人机模拟器,并收集了包含3000余条记录导航轨迹的新型AVDN数据集,这些轨迹源自指挥官与跟随者之间的异步人际对话。指挥官提供初始导航指令,并根据请求提供进一步指导;跟随者则在模拟器中操控无人机,并在必要时提出问题。数据收集过程中,还记录了跟随者对无人机视觉观察的注意力分布。基于AVDN数据集,我们研究基于(完整)对话历史的空中导航任务,并提出一种有效的人类注意力辅助Transformer模型(HAA-Transformer),该模型能够同时预测导航路径点与人类注意力分布。