Zero-shot sketch-based image retrieval (ZS-SBIR) is challenging due to the cross-domain nature of sketches and photos, as well as the semantic gap between seen and unseen image distributions. Previous methods fine-tune pre-trained models with various side information and learning strategies to learn a compact feature space that is shared between the sketch and photo domains and bridges seen and unseen classes. However, these efforts are inadequate in adapting domains and transferring knowledge from seen to unseen classes. In this paper, we present an effective ``Adapt and Align'' approach to address the key challenges. Specifically, we insert simple and lightweight domain adapters to learn new abstract concepts of the sketch domain and improve cross-domain representation capabilities. Inspired by recent advances in image-text foundation models (e.g., CLIP) on zero-shot scenarios, we explicitly align the learned image embedding with a more semantic text embedding to achieve the desired knowledge transfer from seen to unseen classes. Extensive experiments on three benchmark datasets and two popular backbones demonstrate the superiority of our method in terms of retrieval accuracy and flexibility.
翻译:零样本草图图像检索(ZS-SBIR)因草图与照片的跨域特性,以及可见与不可见图分布之间的语义鸿沟而充满挑战。现有方法通过引入多种辅助信息和学习策略对预训练模型进行微调,旨在学习一个在草图域和照片域之间共享、并连通可见与不可见类别的紧凑特征空间。然而,这些努力在域适应和从可见类到不可见类的知识迁移方面仍显不足。本文提出了一种有效的“适应与对齐”方法以应对核心挑战。具体而言,我们插入简单轻量的域适配器来学习草图域的新抽象概念,从而提升跨域表示能力。受近期图像-文本基础模型(如CLIP)在零样本场景中取得进展的启发,我们将所学图像嵌入与更具语义性的文本嵌入进行显式对齐,以实现从可见类到不可见类的预期知识迁移。在三个基准数据集和两个主流骨干网络上进行的广泛实验表明,本方法在检索精度和灵活性方面均具有优越性。