The popularity of Contrastive Language-Image Pre-training (CLIP) has propelled its application to diverse downstream vision tasks. To improve its capacity on downstream tasks, few-shot learning has become a widely-adopted technique. However, existing methods either exhibit limited performance or suffer from excessive learnable parameters. In this paper, we propose APE, an Adaptive Prior rEfinement method for CLIP's pre-trained knowledge, which achieves superior accuracy with high computational efficiency. Via a prior refinement module, we analyze the inter-class disparity in the downstream data and decouple the domain-specific knowledge from the CLIP-extracted cache model. On top of that, we introduce two model variants, a training-free APE and a training-required APE-T. We explore the trilateral affinities between the test image, prior cache model, and textual representations, and only enable a lightweight category-residual module to be trained. For the average accuracy over 11 benchmarks, both APE and APE-T attain state-of-the-art and respectively outperform the second-best by +1.59% and +1.99% under 16 shots with x30 less learnable parameters.
翻译:对比语言-图像预训练(CLIP)的普及推动其应用于各类下游视觉任务。为提升其在下游任务上的能力,少样本学习已成为广泛应用的技术。然而,现有方法要么性能有限,要么存在可学习参数过多的问题。本文提出APE——一种面向CLIP预训练知识的自适应先验精化方法,在实现卓越精度的同时保持高计算效率。通过先验精化模块,我们分析下游数据中的类间差异,并将领域特定知识与CLIP提取的缓存模型解耦。在此基础上,我们引入两种模型变体:无需训练的APE和需训练的APE-T。我们探索测试图像、先验缓存模型与文本表征之间的三方关联性,仅训练轻量化的类别残差模块。在11个基准测试的平均精度上,APE和APE-T均达到最优水平,在16样本场景下分别以30倍更少的可学习参数超越第二名+1.59%和+1.99%。