Zero-shot learning offers an efficient solution for a machine learning model to treat unseen categories, avoiding exhaustive data collection. Zero-shot Sketch-based Image Retrieval (ZS-SBIR) simulates real-world scenarios where it is hard and costly to collect paired sketch-photo samples. We propose a novel framework that indirectly aligns sketches and photos by contrasting them through texts, removing the necessity of access to sketch-photo pairs. With an explicit modality encoding learned from data, our approach disentangles modality-agnostic semantics from modality-specific information, bridging the modality gap and enabling effective cross-modal content retrieval within a joint latent space. From comprehensive experiments, we verify the efficacy of the proposed model on ZS-SBIR, and it can be also applied to generalized and fine-grained settings.
翻译:零样本学习为机器学习模型处理未见类别提供了一种高效解决方案,避免了详尽的数据收集过程。零样本草图图像检索(ZS-SBIR)模拟了真实世界中难以且成本高昂地收集配对草图-照片样本的场景。我们提出了一种新颖框架,通过文本对比间接对齐草图和照片,消除了对草图-照片对访问的需求。借助从数据中学习到的显式模态编码,我们的方法将模态无关的语义信息与模态特定信息解耦,弥合了模态差异,并在联合隐空间中实现了有效的跨模态内容检索。通过全面实验,我们验证了所提模型在ZS-SBIR上的有效性,同时该模型也可推广至广义和细粒度设定中。