In this research, we deal with the problem of visual question answering (VQA) in remote sensing. While remotely sensed images contain information significant for the task of identification and object detection, they pose a great challenge in their processing because of high dimensionality, volume and redundancy. Furthermore, processing image information jointly with language features adds additional constraints, such as mapping the corresponding image and language features. To handle this problem, we propose a cross attention based approach combined with information maximization. The CNN-LSTM based cross-attention highlights the information in the image and language modalities and establishes a connection between the two, while information maximization learns a low dimensional bottleneck layer, that has all the relevant information required to carry out the VQA task. We evaluate our method on two VQA remote sensing datasets of different resolutions. For the high resolution dataset, we achieve an overall accuracy of 79.11% and 73.87% for the two test sets while for the low resolution dataset, we achieve an overall accuracy of 85.98%.
翻译:在本研究中,我们探讨了遥感领域中视觉问答(VQA)问题。尽管遥感图像包含对识别与目标检测任务至关重要的信息,但其高维度、大数据量及冗余性给处理带来了巨大挑战。此外,将图像信息与语言特征联合处理还需额外约束,例如匹配对应的图像与语言特征。为解决此问题,我们提出一种基于交叉注意力与信息最大化的方法。基于CNN-LSTM的交叉注意力机制能够突出图像与语言模态中的信息,并建立两者间的关联;而信息最大化则学习一个低维瓶颈层,该层包含执行VQA任务所需的所有相关信息。我们在两个不同分辨率的遥感VQA数据集上评估了该方法。对于高分辨率数据集,我们在两个测试集上分别取得了79.11%和73.87%的整体准确率;对于低分辨率数据集,整体准确率达到85.98%。