Visual question answering (VQA) methods in remote sensing (RS) aim to answer natural language questions with respect to an RS image. Most of the existing methods require a large amount of computational resources, which limits their application in operational scenarios in RS. To address this issue, in this paper we present an effective lightweight transformer-based VQA in RS (LiT-4-RSVQA) architecture for efficient and accurate VQA in RS. Our architecture consists of: i) a lightweight text encoder module; ii) a lightweight image encoder module; iii) a fusion module; and iv) a classification module. The experimental results obtained on a VQA benchmark dataset demonstrate that our proposed LiT-4-RSVQA architecture provides accurate VQA results while significantly reducing the computational requirements on the executing hardware. Our code is publicly available at https://git.tu-berlin.de/rsim/lit4rsvqa.
翻译:遥感(RS)中的视觉问答(VQA)方法旨在回答与遥感图像相关的自然语言问题。现有方法大多需要大量计算资源,这限制了其在遥感实际应用场景中的部署。为解决这一问题,本文提出一种高效的基于轻量化Transformer的遥感视觉问答(LiT-4-RSVQA)架构,用于实现遥感领域高效且准确的VQA。该架构由以下四部分组成:i) 轻量化文本编码模块;ii) 轻量化图像编码模块;iii) 融合模块;iv) 分类模块。在VQA基准数据集上的实验结果表明,我们提出的LiT-4-RSVQA架构在显著降低执行硬件计算需求的同时,能够提供准确的VQA结果。我们的代码已开源在https://git.tu-berlin.de/rsim/lit4rsvqa。