Zero-shot sketch-based remote sensing image retrieval based on multi-level and attention-guided tokenization

Effectively and efficiently retrieving images from remote sensing databases is a critical challenge in the realm of remote sensing big data. Utilizing hand-drawn sketches as retrieval inputs offers intuitive and user-friendly advantages, yet the potential of multi-level feature integration from sketches remains underexplored, leading to suboptimal retrieval performance. To address this gap, our study introduces a novel zero-shot, sketch-based retrieval method for remote sensing images, leveraging multi-level feature extraction, self-attention-guided tokenization and filtering, and cross-modality attention update. This approach employs only vision information and does not require semantic knowledge concerning the sketch and image. It starts by employing multi-level self-attention guided feature extraction to tokenize the query sketches, as well as self-attention feature extraction to tokenize the candidate images. It then employs cross-attention mechanisms to establish token correspondence between these two modalities, facilitating the computation of sketch-to-image similarity. Our method significantly outperforms existing sketch-based remote sensing image retrieval techniques, as evidenced by tests on multiple datasets. Notably, it also exhibits robust zero-shot learning capabilities and strong generalizability in handling unseen categories and novel remote sensing data. The method's scalability can be further enhanced by the pre-calculation of retrieval tokens for all candidate images in a database. This research underscores the significant potential of multi-level, attention-guided tokenization in cross-modal remote sensing image retrieval. For broader accessibility and research facilitation, we have made the code and dataset used in this study publicly available online. Code and dataset are available at https://github.com/Snowstormfly/Cross-modal-retrieval-MLAGT.

翻译：有效且高效地从遥感数据库中检索图像是遥感大数据领域的关键挑战。以手绘草图作为检索输入具有直观和用户友好的优势，然而草图多层级特征融合的潜力尚未被充分挖掘，导致检索性能欠佳。为弥补这一不足，本研究提出了一种新颖的零样本遥感图像草图检索方法，该方法融合了多层级特征提取、自注意力引导的令牌化与过滤机制以及跨模态注意力更新。本方法仅利用视觉信息，无需借助草图和图像相关的语义知识。其首先采用多层级自注意力引导的特征提取对查询草图进行令牌化，并利用自注意力特征提取对候选图像进行令牌化；随后通过交叉注意力机制建立两种模态间的令牌对应关系，从而计算草图与图像的相似度。在多个数据集上的实验表明，本方法显著优于现有基于草图的遥感图像检索技术。值得注意的是，该方法在处理未见类别和新型遥感数据时展现出强大的零样本学习能力与泛化性能。通过预计算数据库中所有候选图像的检索令牌，可进一步增强其可扩展性。本研究凸显了多层级注意力引导令牌化技术在跨模态遥感图像检索中的巨大潜力。为促进更广泛的访问与研究便利，我们已公开本研究使用的代码与数据集。代码与数据集详见 https://github.com/Snowstormfly/Cross-modal-retrieval-MLAGT。