Zero-shot sketch-based remote sensing image retrieval based on multi-level and attention-guided tokenization

Effectively and efficiently retrieving images from remote sensing databases is a critical challenge in the realm of remote sensing big data. Utilizing hand-drawn sketches as retrieval inputs offers intuitive and user-friendly advantages, yet the potential of multi-level feature integration from sketches remains underexplored, leading to suboptimal retrieval performance. To address this gap, our study introduces a novel zero-shot, sketch-based retrieval method for remote sensing images, leveraging multi-level feature extraction, self-attention-guided tokenization and filtering, and cross-modality attention update. This approach employs only vision information and does not require semantic knowledge concerning the sketch and image. It starts by employing multi-level self-attention guided feature extraction to tokenize the query sketches, as well as self-attention feature extraction to tokenize the candidate images. It then employs cross-attention mechanisms to establish token correspondence between these two modalities, facilitating the computation of sketch-to-image similarity. Our method significantly outperforms existing sketch-based remote sensing image retrieval techniques, as evidenced by tests on multiple datasets. Notably, it also exhibits robust zero-shot learning capabilities and strong generalizability in handling unseen categories and novel remote sensing data. The method's scalability can be further enhanced by the pre-calculation of retrieval tokens for all candidate images in a database. This research underscores the significant potential of multi-level, attention-guided tokenization in cross-modal remote sensing image retrieval. For broader accessibility and research facilitation, we have made the code and dataset used in this study publicly available online. Code and dataset are available at https://github.com/Snowstormfly/Cross-modal-retrieval-MLAGT.

翻译：高效准确地从遥感数据库中检索图像是遥感大数据领域的一项关键挑战。以手绘草图作为检索输入具有直观且用户友好的优势，但草图多层级特征整合的潜力尚未得到充分探索，导致检索性能欠佳。为弥补这一不足，本研究提出了一种新颖的零样本手绘草图遥感图像检索方法，该方法融合了多层级特征提取、自注意力引导的分词与过滤机制，以及跨模态注意力更新。该技术仅利用视觉信息，无需涉及草图与图像的语义知识。首先，采用多层级自注意力引导的特征提取对查询草图进行分词，同时利用自注意力特征提取对候选图像进行分词；随后，通过交叉注意力机制建立两种模态间的分词对应关系，从而计算草图与图像的相似度。在多个数据集上的测试表明，该方法显著优于现有基于草图的遥感图像检索技术。值得注意的是，该方法在处理未见类别与新型遥感数据时展现出强大的零样本学习能力与泛化性。通过预先计算数据库中所有候选图像的检索分词，可进一步提升方法的可扩展性。本研究揭示了多层级注意力引导分词在跨模态遥感图像检索中的巨大潜力。为促进研究的广泛可及性，我们已将本研究所用代码与数据集公开于 https://github.com/Snowstormfly/Cross-modal-retrieval-MLAGT。