Recent advances in contrastive language-image pretraining (CLIP) have demonstrated strong capabilities in zero-shot classification by aligning visual representations with target text embeddings in an image level. However, in dense prediction tasks, CLIP often struggles to localize visual features within an image and fails to give accurate pixel-level predictions, which prevents it from functioning as a generalized visual foundation model. In this work, we aim to enhance CLIP's potential for semantic segmentation with minimal modifications to its pretrained models. By rethinking self-attention, we surprisingly find that CLIP can adapt to dense prediction tasks by simply introducing a novel Correlative Self-Attention (CSA) mechanism. Specifically, we replace the traditional self-attention block of CLIP vision encoder's last layer by our CSA module and reuse its pretrained projection matrices of query, key, and value, leading to a training-free adaptation approach for CLIP's zero-shot semantic segmentation. Extensive experiments show the advantage of CSA: we obtain a 38.2% average zero-shot mIoU across eight semantic segmentation benchmarks highlighted in this paper, significantly outperforming the existing SoTA's 33.9% and the vanilla CLIP's 14.1%.
翻译:近期对比语言-图像预训练(CLIP)的进展通过将视觉表示与目标文本嵌入在图像层面进行对齐,在零样本分类任务中展现了强大的能力。然而,在密集预测任务中,CLIP往往难以定位图像内部的视觉特征,无法给出准确的像素级预测,这阻碍了其作为通用视觉基础模型的功能。本研究旨在以最小程度修改预训练模型的方式提升CLIP在语义分割中的潜力。通过重新思考自注意力,我们惊喜地发现,仅需引入一种新型相关自注意力(CSA)机制,CLIP即可适应密集预测任务。具体而言,我们将CLIP视觉编码器最后一层的传统自注意力模块替换为CSA模块,并重用其预训练的查询、键和值投影矩阵,从而实现无需训练的CLIP零样本语义分割适配方法。大量实验证明了CSA的优势:在本论文强调的八个语义分割基准测试中,我们获得了38.2%的平均零样本mIoU,显著优于现有最佳方法的33.9%和原始CLIP的14.1%。