Visual question answering (VQA) methods in remote sensing (RS) aim to answer natural language questions with respect to an RS image. Most of the existing methods require a large amount of computational resources, which limits their application in operational scenarios in RS. To address this issue, in this paper we present an effective lightweight transformer-based VQA in RS (LiT-4-RSVQA) architecture for efficient and accurate VQA in RS. Our architecture consists of: i) a lightweight text encoder module; ii) a lightweight image encoder module; iii) a fusion module; and iv) a classification module. The experimental results obtained on a VQA benchmark dataset demonstrate that our proposed LiT-4-RSVQA architecture provides accurate VQA results while significantly reducing the computational requirements on the executing hardware. Our code is publicly available at https://git.tu-berlin.de/rsim/lit4rsvqa.
翻译:遥感(RS)中的视觉问答(VQA)方法旨在针对遥感图像回答自然语言问题。现有的大多数方法需要大量的计算资源,这限制了其在遥感实际操作场景中的应用。为解决这一问题,本文提出了一种高效的基于轻量级Transformer的遥感VQA架构(LiT-4-RSVQA),以实现高效且准确的遥感VQA。我们的架构包括:i)轻量级文本编码器模块;ii)轻量级图像编码器模块;iii)融合模块;以及iv)分类模块。在VQA基准数据集上获得的实验结果表明,我们提出的LiT-4-RSVQA架构在提供准确VQA结果的同时,显著降低了执行硬件的计算需求。我们的代码公开在 https://git.tu-berlin.de/rsim/lit4rsvqa。