Visual Question Answering for Remote Sensing (RSVQA) is a task that aims at answering natural language questions about the content of a remote sensing image. The visual features extraction is therefore an essential step in a VQA pipeline. By incorporating attention mechanisms into this process, models gain the ability to focus selectively on salient regions of the image, prioritizing the most relevant visual information for a given question. In this work, we propose to embed an attention mechanism guided by segmentation into a RSVQA pipeline. We argue that segmentation plays a crucial role in guiding attention by providing a contextual understanding of the visual information, underlying specific objects or areas of interest. To evaluate this methodology, we provide a new VQA dataset that exploits very high-resolution RGB orthophotos annotated with 16 segmentation classes and question/answer pairs. Our study shows promising results of our new methodology, gaining almost 10% of overall accuracy compared to a classical method on the proposed dataset.
翻译:遥感图像视觉问答(RSVQA)旨在针对遥感图像内容回答自然语言问题。视觉特征提取是VQA流程中的关键步骤。通过将注意力机制引入该过程,模型能够选择性地聚焦于图像的显著区域,优先处理与给定问题最相关的视觉信息。本研究提出在RSVQA流程中嵌入由分割引导的注意力机制。我们认为,分割通过提供对视觉信息的上下文理解(特别是针对特定目标或感兴趣区域),在引导注意力方面发挥着关键作用。为评估该方法,我们构建了新的VQA数据集,该数据集采用标注有16个分割类别及问答对的超高分辨率RGB正射影像。实验表明,新方法在提出数据集上相比传统方法整体准确率提升近10%,展现出良好的应用前景。