GOPro: Generate and Optimize Prompts in CLIP using Self-Supervised Learning

Large-scale foundation models, such as CLIP, have demonstrated remarkable success in visual recognition tasks by embedding images in a semantically rich space. Self-supervised learning (SSL) has also shown promise in improving visual recognition by learning invariant features. However, the combination of CLIP with SSL is found to face challenges due to the multi-task framework that blends CLIP's contrastive loss and SSL's loss, including difficulties with loss weighting and inconsistency among different views of images in CLIP's output space. To overcome these challenges, we propose a prompt learning-based model called GOPro, which is a unified framework that ensures similarity between various augmented views of input images in a shared image-text embedding space, using a pair of learnable image and text projectors atop CLIP, to promote invariance and generalizability. To automatically learn such prompts, we leverage the visual content and style primitives extracted from pre-trained CLIP and adapt them to the target task. In addition to CLIP's cross-domain contrastive loss, we introduce a visual contrastive loss and a novel prompt consistency loss, considering the different views of the images. GOPro is trained end-to-end on all three loss objectives, combining the strengths of CLIP and SSL in a principled manner. Empirical evaluations demonstrate that GOPro outperforms the state-of-the-art prompting techniques on three challenging domain generalization tasks across multiple benchmarks by a significant margin. Our code is available at https://github.com/mainaksingha01/GOPro.

翻译：大规模基础模型（如CLIP）通过将图像嵌入语义丰富的空间，在视觉识别任务中展现了显著成功。自监督学习（SSL）通过学习不变特征，也展现出提升视觉识别能力的潜力。然而，研究发现，由于CLIP对比损失与SSL损失融合的多任务框架面临挑战——包括损失权重难以平衡以及CLIP输出空间中图像不同视角间的不一致性——因此，将CLIP与SSL结合存在困难。为克服这些挑战，我们提出一种基于提示学习的模型GOPro，该统一框架通过在CLIP之上使用一对可学习的图像和文本投影器，确保输入图像各类增强视图在共享的图像-文本嵌入空间中的相似性，从而提升不变性与泛化能力。为自动学习此类提示，我们利用从预训练CLIP中提取的视觉内容与风格基元，并将其适配至目标任务。除CLIP的跨领域对比损失外，我们还引入视觉对比损失与一种新颖的提示一致性损失，以兼顾图像的不同视角。GOPro在全部三项损失目标上以端到端方式训练，以原则性方式融合CLIP与SSL的优势。实证评估表明，GOPro在多个基准的三种具有挑战性的领域泛化任务上，以显著优势超越最先进的提示技术。我们的代码见https://github.com/mainaksingha01/GOPro。