This report details the method of the winning entry of the AVDN Challenge in ICCV 2023. The competition addresses the Aerial Navigation from Dialog History (ANDH) task, which requires a drone agent to associate dialog history with aerial observations to reach the destination. For better cross-modal grounding abilities of the drone agent, we propose a Target-Grounded Graph-Aware Transformer (TG-GAT) framework. Concretely, TG-GAT first leverages a graph-aware transformer to capture spatiotemporal dependency, which benefits navigation state tracking and robust action planning. In addition, an auxiliary visual grounding task is devised to boost the agent's awareness of referred landmarks. Moreover, a hybrid augmentation strategy based on large language models is utilized to mitigate data scarcity limitations. Our TG-GAT framework won the AVDN Challenge 2023, with 2.2% and 3.0% absolute improvements over the baseline on SPL and SR metrics, respectively. The code is available at https://github.com/yifeisu/avdn-challenge.
翻译:本报告详细介绍了ICCV 2023中AVDN挑战赛获胜方案的方法。该竞赛针对基于对话历史的空中导航(ANDH)任务,要求无人机代理将对话历史与空中观测关联以到达目的地。为提升无人机代理的跨模态定位能力,我们提出了目标引导的图感知Transformer(TG-GAT)框架。具体而言,TG-GAT首先利用图感知Transformer捕获时空依赖关系,这有利于导航状态跟踪和稳健的行动规划。此外,我们设计了一个辅助视觉定位任务,以增强代理对提及地标的感知能力。同时,采用基于大语言模型的混合增强策略来缓解数据稀缺限制。我们的TG-GAT框架赢得了2023年AVDN挑战赛,在SPL和SR指标上分别比基线提升了2.2%和3.0%。代码开源地址:https://github.com/yifeisu/avdn-challenge。