With the emergence of pre-trained vision-language models like CLIP, how to adapt them to various downstream classification tasks has garnered significant attention in recent research. The adaptation strategies can be typically categorized into three paradigms: zero-shot adaptation, few-shot adaptation, and the recently-proposed training-free few-shot adaptation. Most existing approaches are tailored for a specific setting and can only cater to one or two of these paradigms. In this paper, we introduce a versatile adaptation approach that can effectively work under all three settings. Specifically, we propose the dual memory networks that comprise dynamic and static memory components. The static memory caches training data knowledge, enabling training-free few-shot adaptation, while the dynamic memory preserves historical test features online during the testing process, allowing for the exploration of additional data insights beyond the training set. This novel capability enhances model performance in the few-shot setting and enables model usability in the absence of training data. The two memory networks employ the same flexible memory interactive strategy, which can operate in a training-free mode and can be further enhanced by incorporating learnable projection layers. Our approach is tested across 11 datasets under the three task settings. Remarkably, in the zero-shot scenario, it outperforms existing methods by over 3\% and even shows superior results against methods utilizing external training data. Additionally, our method exhibits robust performance against natural distribution shifts. Codes are available at \url{https://github.com/YBZh/DMN}.
翻译:随着预训练视觉-语言模型(如CLIP)的兴起,如何将其适配至各类下游分类任务已成为近年研究的热点。当前适配策略主要分为三类范式:零样本适配、小样本适配以及近期提出的免训练小样本适配。现有方法大多针对特定场景设计,仅能支持其中一至两种范式。本文提出一种通用适配方法,可在上述三类场景中有效运行。具体而言,我们构建了包含动态记忆与静态记忆组件的双记忆网络:静态记忆缓存训练数据知识,实现免训练小样本适配;动态记忆则在测试过程中在线保留历史测试特征,进而挖掘训练集之外的数据洞察。这一新能力不仅增强了模型在小样本场景下的性能,也使其在缺乏训练数据时仍具可用性。两个记忆网络采用相同的灵活记忆交互策略,既可运行于免训练模式,也可通过引入可学习投影层进一步增强。我们在11个数据集上对三种任务场景进行了全面测试。值得注意的是,在零样本场景下,本方法相较现有方法性能提升超过3%,甚至优于使用外部训练数据的方法。此外,本方法在自然分布偏移场景下展现出稳健性能。代码已开源至 \url{https://github.com/YBZh/DMN}。