Zero-shot sketch-based image retrieval (ZS-SBIR) is challenging due to the cross-domain nature of sketches and photos, as well as the semantic gap between seen and unseen image distributions. Previous methods fine-tune pre-trained models with various side information and learning strategies to learn a compact feature space that is shared between the sketch and photo domains and bridges seen and unseen classes. However, these efforts are inadequate in adapting domains and transferring knowledge from seen to unseen classes. In this paper, we present an effective ``Adapt and Align'' approach to address the key challenges. Specifically, we insert simple and lightweight domain adapters to learn new abstract concepts of the sketch domain and improve cross-domain representation capabilities. Inspired by recent advances in image-text foundation models (e.g., CLIP) on zero-shot scenarios, we explicitly align the learned image embedding with a more semantic text embedding to achieve the desired knowledge transfer from seen to unseen classes. Extensive experiments on three benchmark datasets and two popular backbones demonstrate the superiority of our method in terms of retrieval accuracy and flexibility.
翻译:零样本素描图像检索(ZS-SBIR)因素描与照片的跨域特性,以及可见与不可见图像分布间的语义鸿沟而极具挑战性。以往方法通过多种辅助信息和学习策略微调预训练模型,以学习素描域与照片域共享的紧凑特征空间,从而桥接可见与不可见类别。然而,这些方法在领域适应及从可见类别到不可见类别的知识迁移方面仍显不足。本文提出一种高效的“适应与对齐”方法以攻克核心难题。具体而言,我们插入简单轻量的领域适配器,学习素描领域的新抽象概念,提升跨域表征能力。受近期图像-文本基础模型(如CLIP)在零样本场景中突破性进展的启发,我们显式地将学习到的图像嵌入与更具语义性的文本嵌入对齐,实现从可见类别到不可见类别的预期知识迁移。在三个基准数据集和两个主流骨干网络上的大量实验表明,本方法在检索精度与灵活性方面均具有显著优势。