PIR: Remote Sensing Image-Text Retrieval with Prior Instruction Representation Learning

Remote sensing image-text retrieval constitutes a foundational aspect of remote sensing interpretation tasks, facilitating the alignment of vision and language representations. This paper introduces a prior instruction representation (PIR) learning paradigm that draws on prior knowledge to instruct adaptive learning of vision and text representations. Based on PIR, a domain-adapted remote sensing image-text retrieval framework PIR-ITR is designed to address semantic noise issues in vision-language understanding tasks. However, with massive additional data for pre-training the vision-language foundation model, remote sensing image-text retrieval is further developed into an open-domain retrieval task. Continuing with the above, we propose PIR-CLIP, a domain-specific CLIP-based framework for remote sensing image-text retrieval, to address semantic noise in remote sensing vision-language representations and further improve open-domain retrieval performance. In vision representation, Vision Instruction Representation (VIR) based on Spatial-PAE utilizes the prior-guided knowledge of the remote sensing scene recognition by building a belief matrix to select key features for reducing the impact of semantic noise. In text representation, Language Cycle Attention (LCA) based on Temporal-PAE uses the previous time step to cyclically activate the current time step to enhance text representation capability. A cluster-wise Affiliation Loss (AL) is proposed to constrain the inter-classes and to reduce the semantic confusion zones in the common subspace. Comprehensive experiments demonstrate that PIR could enhance vision and text representations and outperform the state-of-the-art methods of closed-domain and open-domain retrieval on two benchmark datasets, RSICD and RSITMD.

翻译：遥感图像-文本检索构成遥感解译任务的基础方面，促进了视觉与语言表示的语义对齐。本文提出一种先验指令表示（PIR）学习范式，利用先验知识指导视觉与文本表示的适应性学习。基于PIR，设计面向领域的遥感图像-文本检索框架PIR-ITR，以解决视觉-语言理解任务中的语义噪声问题。然而，通过大量额外数据预训练视觉-语言基础模型，遥感图像-文本检索被进一步发展为开放域检索任务。在此基础上，我们提出PIR-CLIP——一种面向遥感图像-文本检索的领域特异性CLIP框架，旨在解决遥感视觉-语言表示中的语义噪声问题，并进一步提升开放域检索性能。在视觉表示方面，基于空间先验自编码器（Spatial-PAE）的视觉指令表示（VIR）通过构建置信矩阵选择关键特征，利用遥感场景识别的先验引导知识降低语义噪声影响。在文本表示方面，基于时间先验自编码器（Temporal-PAE）的语言循环注意力（LCA）通过前一时刻循环激活当前时刻，增强文本表示能力。提出簇级关联损失（AL）对类别间表示进行约束，以缩小公共子空间中的语义混淆区域。综合实验表明，PIR能够增强视觉与文本表示，并在RSICD和RSITMD两个基准数据集上的封闭域与开放域检索任务中均优于现有最先进方法。