Learning generalized representations from limited training samples is crucial for applying deep neural networks in low-resource scenarios. Recently, methods based on Contrastive Language-Image Pre-training (CLIP) have exhibited promising performance in few-shot adaptation tasks. To avoid catastrophic forgetting and overfitting caused by few-shot fine-tuning, existing works usually freeze the parameters of CLIP pre-trained on large-scale datasets, overlooking the possibility that some parameters might not be suitable for downstream tasks. To this end, we revisit CLIP's visual encoder with a specific focus on its distinctive attention pooling layer, which performs a spatial weighted-sum of the dense feature maps. Given that dense feature maps contain meaningful semantic information, and different semantics hold varying importance for diverse downstream tasks (such as prioritizing semantics like ears and eyes in pet classification tasks rather than side mirrors), using the same weighted-sum operation for dense features across different few-shot tasks might not be appropriate. Hence, we propose fine-tuning the parameters of the attention pooling layer during the training process to encourage the model to focus on task-specific semantics. In the inference process, we perform residual blending between the features pooled by the fine-tuned and the original attention pooling layers to incorporate both the few-shot knowledge and the pre-trained CLIP's prior knowledge. We term this method as Semantic-Aware FinE-tuning (SAFE). SAFE is effective in enhancing the conventional few-shot CLIP and is compatible with the existing adapter approach (termed SAFE-A).
翻译:从有限的训练样本中学习通用表征对于在低资源场景下应用深度神经网络至关重要。近年来,基于对比语言-图像预训练(CLIP)的方法在小样本适应任务中展现出显著性能。为避免小样本微调引发的灾难性遗忘和过拟合,现有工作通常冻结在大规模数据集上预训练的CLIP参数,却忽视了部分参数可能不适用于下游任务的可能性。为此,我们重新审视CLIP视觉编码器,特别关注其独特的注意力池化层——该层对密集特征图执行空间加权求和。由于密集特征图包含丰富的语义信息,且不同语义对不同下游任务具有差异化重要性(例如宠物分类任务需优先关注耳朵、眼睛等语义而非后视镜),因此对小样本任务中的密集特征采用相同的加权求和操作可能并不恰当。据此,我们提出在训练过程中微调注意力池化层的参数,引导模型聚焦于任务特定语义。在推理阶段,我们对经微调与原始注意力池化层池化后的特征进行残差混合,以融合小样本知识与预训练CLIP的先验知识。我们将该方法命名为语义感知微调(SAFE)。SAFE能有效增强传统小样本CLIP,并与现有适配器方法兼容(称为SAFE-A)。