This report details the methods of the winning entry of the AVDN Challenge in ICCV CLVL 2023. The competition addresses the Aerial Navigation from Dialog History (ANDH) task, which requires a drone agent to associate dialog history with aerial observations to reach the destination. For better cross-modal grounding abilities of the drone agent, we propose a Target-Grounded Graph-Aware Transformer (TG-GAT) framework. Concretely, TG-GAT first leverages a graph-aware transformer to capture spatiotemporal dependency, which benefits navigation state tracking and robust action planning. In addition,an auxiliary visual grounding task is devised to boost the agent's awareness of referred landmarks. Moreover, a hybrid augmentation strategy based on large language models is utilized to mitigate data scarcity limitations. Our TG-GAT framework won the AVDN Challenge, with 2.2% and 3.0% absolute improvements over the baseline on SPL and SR metrics, respectively. The code is available at https://github.com/yifeisu/TG-GAT.
翻译:本报告详述了ICCV CLVL 2023中AVDN挑战赛获胜方案的技术方法。该竞赛聚焦于对话历史引导的空中导航(ANDH)任务,要求无人机代理将对话历史与空中观测信息关联,以抵达目标位置。为提升无人机代理的跨模态定位能力,我们提出了一种面向目标的地图表征Transformer(TG-GAT)框架。具体而言,TG-GAT首先利用图感知Transformer捕获时空依赖关系,从而优化导航状态跟踪与鲁棒行动规划。此外,我们设计了一项辅助视觉定位任务,以增强代理对参考地标的感知能力。同时,基于大型语言模型的混合增强策略被用于缓解数据稀缺问题。我们的TG-GAT框架在AVDN挑战赛中获胜,相较于基线方法,在SPL和SR指标上分别取得2.2%和3.0%的绝对提升。相关代码已开源至https://github.com/yifeisu/TG-GAT。