Sketch-based image retrieval (SBIR) associates hand-drawn sketches with their corresponding realistic images. In this study, we aim to tackle two major challenges of this task simultaneously: i) zero-shot, dealing with unseen categories, and ii) fine-grained, referring to intra-category instance-level retrieval. Our key innovation lies in the realization that solely addressing this cross-category and fine-grained recognition task from the generalization perspective may be inadequate since the knowledge accumulated from limited seen categories might not be fully valuable or transferable to unseen target categories. Inspired by this, in this work, we propose a dual-modal prompting CLIP (DP-CLIP) network, in which an adaptive prompting strategy is designed. Specifically, to facilitate the adaptation of our DP-CLIP toward unpredictable target categories, we employ a set of images within the target category and the textual category label to respectively construct a set of category-adaptive prompt tokens and channel scales. By integrating the generated guidance, DP-CLIP could gain valuable category-centric insights, efficiently adapting to novel categories and capturing unique discriminative clues for effective retrieval within each target category. With these designs, our DP-CLIP outperforms the state-of-the-art fine-grained zero-shot SBIR method by 7.3% in Acc.@1 on the Sketchy dataset. Meanwhile, in the other two category-level zero-shot SBIR benchmarks, our method also achieves promising performance.
翻译:草图图像检索(SBIR)旨在将手绘草图与其对应的真实图像相关联。本研究同时解决该任务的两大核心挑战:i)零样本学习——处理未见类别;ii)细粒度检索——实现类别内的实例级匹配。我们的关键创新在于认识到:仅从泛化视角解决跨类别细粒度识别任务可能存在不足,因为从有限可见类别积累的知识未必完全适用于未见目标类别。受此启发,本文提出双模态提示CLIP网络(DP-CLIP),设计了一种自适应提示策略。具体而言,为促进DP-CLIP对未知目标类别的适应能力,我们利用目标类别中的图像集合与文本类别标签,分别构建类别自适应提示标记和通道缩放系数。通过整合生成的引导信息,DP-CLIP能够获取有价值的类别中心化洞察,高效适应新类别并捕获独特判别线索,从而在每个目标类别内实现有效检索。基于这些设计,我们的DP-CLIP在Sketchy数据集上的Acc.@1指标较现有最优细粒度零样本SBIR方法提升7.3%。同时,在其他两个类别级零样本SBIR基准测试中,该方法也取得了令人满意的性能表现。