CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not

In this paper, we leverage CLIP for zero-shot sketch based image retrieval (ZS-SBIR). We are largely inspired by recent advances on foundation models and the unparalleled generalisation ability they seem to offer, but for the first time tailor it to benefit the sketch community. We put forward novel designs on how best to achieve this synergy, for both the category setting and the fine-grained setting ("all"). At the very core of our solution is a prompt learning setup. First we show just via factoring in sketch-specific prompts, we already have a category-level ZS-SBIR system that overshoots all prior arts, by a large margin (24.8%) - a great testimony on studying the CLIP and ZS-SBIR synergy. Moving onto the fine-grained setup is however trickier, and requires a deeper dive into this synergy. For that, we come up with two specific designs to tackle the fine-grained matching nature of the problem: (i) an additional regularisation loss to ensure the relative separation between sketches and photos is uniform across categories, which is not the case for the gold standard standalone triplet loss, and (ii) a clever patch shuffling technique to help establishing instance-level structural correspondences between sketch-photo pairs. With these designs, we again observe significant performance gains in the region of 26.9% over previous state-of-the-art. The take-home message, if any, is the proposed CLIP and prompt learning paradigm carries great promise in tackling other sketch-related tasks (not limited to ZS-SBIR) where data scarcity remains a great challenge. Project page: https://aneeshan95.github.io/Sketch_LVM/

翻译：本文利用CLIP进行零样本草稿图像检索（ZS-SBIR）。我们深受近期基础模型进展及其展现出的无与伦比的泛化能力启发，但首次将其定制化以惠及草稿领域。我们提出了新颖的设计方案，以最佳方式实现这种协同作用，同时适用于类别设定和细粒度设定（“一切”）。我们解决方案的核心在于提示学习框架。首先，仅通过引入草稿特异性提示，我们就构建了一个类别级ZS-SBIR系统，其性能大幅超越所有现有技术（提升24.8%），这充分证明了研究CLIP与ZS-SBIR协同作用的巨大价值。然而，向细粒度设定的转变更为棘手，需要深入探究这一协同机制。为此，我们提出两项具体设计以应对问题的细粒度匹配本质：（i）附加正则化损失，确保草稿与照片之间的相对分离度在各类别间保持一致——而这在单独使用标准三元组损失时无法实现；（ii）巧妙的补丁混洗技术，用于建立草稿-照片对之间的实例级结构对应关系。借助这些设计，我们再次观察到相较于先前最优方法，性能显著提升约26.9%。总而言之，本文所提出的CLIP与提示学习范式在处理其他数据稀缺问题严重的草稿相关任务（不限于ZS-SBIR）方面展现出巨大潜力。项目页面：https://aneeshan95.github.io/Sketch_LVM/