Remote Sensing Image-Text Retrieval (RSITR) is pivotal for knowledge services and data mining in the remote sensing (RS) domain. Considering the multi-scale representations in image content and text vocabulary can enable the models to learn richer representations and enhance retrieval. Current multi-scale RSITR approaches typically align multi-scale fused image features with text features, but overlook aligning image-text pairs at distinct scales separately. This oversight restricts their ability to learn joint representations suitable for effective retrieval. We introduce a novel Multi-Scale Alignment (MSA) method to overcome this limitation. Our method comprises three key innovations: (1) Multi-scale Cross-Modal Alignment Transformer (MSCMAT), which computes cross-attention between single-scale image features and localized text features, integrating global textual context to derive a matching score matrix within a mini-batch, (2) a multi-scale cross-modal semantic alignment loss that enforces semantic alignment across scales, and (3) a cross-scale multi-modal semantic consistency loss that uses the matching matrix from the largest scale to guide alignment at smaller scales. We evaluated our method across multiple datasets, demonstrating its efficacy with various visual backbones and establishing its superiority over existing state-of-the-art methods. The GitHub URL for our project is: https://github.com/yr666666/MSA
翻译:遥感图文检索(RSITR)在遥感领域的知识服务和数据挖掘中至关重要。考虑到图像内容和文本词汇的多尺度表征能够使模型学习更丰富的表示并提升检索性能。当前的多尺度RSITR方法通常将多尺度融合的图像特征与文本特征对齐,但忽视了在不同尺度上分别对齐图文对。这一局限限制了其学习适用于有效检索的联合表示的能力。我们提出了一种新颖的多尺度对齐(MSA)方法来克服这一限制。我们的方法包含三个关键创新:(1)多尺度跨模态对齐Transformer(MSCMAT),它计算单尺度图像特征与局部化文本特征之间的交叉注意力,整合全局文本上下文以在小批量内推导匹配分数矩阵;(2)一种多尺度跨模态语义对齐损失,强制实现跨尺度的语义对齐;以及(3)一种跨尺度多模态语义一致性损失,利用最大尺度的匹配矩阵来指导较小尺度的对齐。我们在多个数据集上评估了我们的方法,通过使用不同的视觉骨干网络证明了其有效性,并确立了其相对于现有最先进方法的优越性。我们项目的GitHub URL是:https://github.com/yr666666/MSA