Learning generalized representations from limited training samples is crucial for applying deep neural networks in low-resource scenarios. Recently, methods based on Contrastive Language-Image Pre-training (CLIP) have exhibited promising performance in few-shot adaptation tasks. To avoid catastrophic forgetting and overfitting caused by few-shot fine-tuning, existing works usually freeze the parameters of CLIP pre-trained on large-scale datasets, overlooking the possibility that some parameters might not be suitable for downstream tasks. To this end, we revisit CLIP's visual encoder with a specific focus on its distinctive attention pooling layer, which performs a spatial weighted-sum of the dense feature maps. Given that dense feature maps contain meaningful semantic information, and different semantics hold varying importance for diverse downstream tasks (such as prioritizing semantics like ears and eyes in pet classification tasks rather than side mirrors), using the same weighted-sum operation for dense features across different few-shot tasks might not be appropriate. Hence, we propose fine-tuning the parameters of the attention pooling layer during the training process to encourage the model to focus on task-specific semantics. In the inference process, we perform residual blending between the features pooled by the fine-tuned and the original attention pooling layers to incorporate both the few-shot knowledge and the pre-trained CLIP's prior knowledge. We term this method as Semantic-Aware FinE-tuning (SAFE). SAFE is effective in enhancing the conventional few-shot CLIP and is compatible with the existing adapter approach (termed SAFE-A).
翻译:从有限训练样本中学习通用表示对于在低资源场景下应用深度神经网络至关重要。近年来,基于对比语言-图像预训练(CLIP)的方法在小样本适应任务中展现出优异性能。为避免小样本微调导致的灾难性遗忘和过拟合,现有工作通常冻结在大规模数据集上预训练的CLIP参数,忽略了某些参数可能不适合下游任务的可能性。为此,我们重新审视CLIP的视觉编码器,重点关注其独特的注意力池化层——该层对密集特征图执行空间加权求和。鉴于密集特征图包含有意义的语义信息,且不同语义对各类下游任务的重要性存在差异(例如宠物分类任务中更关注耳朵、眼睛等语义而非后视镜),对不同小样本任务的密集特征使用相同的加权求和操作可能并不恰当。因此,我们提出在训练过程中微调注意力池化层的参数,促使模型聚焦于任务特定语义。在推理过程中,我们对微调层与原始注意力池化层池化后的特征进行残差融合,以整合小样本知识与预训练CLIP的先验知识。我们将该方法命名为语义感知微调(SAFE)。SAFE能有效增强传统小样本CLIP,并与现有适配器方法兼容(称为SAFE-A)。