Remote sensing cross-modal text-image retrieval (RSCTIR) has gained attention for its utility in information mining. However, challenges remain in effectively integrating global and local information due to variations in remote sensing imagery and ensuring proper feature pre-alignment before modal fusion, which affects retrieval accuracy and efficiency. To address these issues, we propose CMPAGL, a cross-modal pre-aligned method leveraging global and local information. Our Gswin transformer block combines local window self-attention and global-local window cross-attention to capture multi-scale features. A pre-alignment mechanism simplifies modal fusion training, improving retrieval performance. Additionally, we introduce a similarity matrix reweighting (SMR) algorithm for reranking, and enhance the triplet loss function with an intra-class distance term to optimize feature learning. Experiments on four datasets, including RSICD and RSITMD, validate CMPAGL's effectiveness, achieving up to 4.65% improvement in R@1 and 2.28% in mean Recall (mR) over state-of-the-art methods.
翻译:遥感跨模态图文检索(RSCTIR)因其在信息挖掘中的实用性而受到关注。然而,由于遥感图像的多样性,在有效融合全局与局部信息方面仍存在挑战,同时在模态融合前确保特征预对齐的准确性也影响着检索的精度与效率。为解决这些问题,我们提出了CMPAGL,一种利用全局与局部信息的跨模态预对齐方法。我们的Gswin transformer模块结合了局部窗口自注意力与全局-局部窗口交叉注意力机制,以捕获多尺度特征。预对齐机制简化了模态融合训练过程,从而提升了检索性能。此外,我们引入了相似度矩阵重加权(SMR)算法进行重排序,并通过在triplet损失函数中加入类内距离项来优化特征学习。在包括RSICD和RSITMD在内的四个数据集上的实验验证了CMPAGL的有效性,相较于现有最优方法,其在R@1指标上最高提升了4.65%,在平均召回率(mR)上提升了2.28%。