CLIP has achieved impressive zero-shot performance after pre-training on a large-scale dataset consisting of paired image-text data. Previous works have utilized CLIP by incorporating manually designed visual prompts like colored circles and blur masks into the images to guide the model's attention, showing enhanced zero-shot performance in downstream tasks. Although these methods have achieved promising results, they inevitably alter the original information of the images, which can lead to failure in specific tasks. We propose a train-free method Foveal-Attention CLIP (FALIP), which adjusts the CLIP's attention by inserting foveal attention masks into the multi-head self-attention module. We demonstrate FALIP effectively boosts CLIP zero-shot performance in tasks such as referring expressions comprehension, image classification, and 3D point cloud recognition. Experimental results further show that FALIP outperforms existing methods on most metrics and can augment current methods to enhance their performance.
翻译:CLIP 在大规模图文配对数据集上进行预训练后,已展现出令人印象深刻的零样本性能。先前的研究通过将人工设计的视觉提示(如彩色圆圈和模糊掩码)融入图像中来利用 CLIP,以引导模型的注意力,并在下游任务中展现出增强的零样本性能。尽管这些方法已取得可喜的成果,但它们不可避免地改变了图像的原始信息,这可能导致在特定任务中失败。我们提出了一种无需训练的方法——中央凹注意力CLIP(FALIP),该方法通过在多头自注意力模块中插入中央凹注意力掩码来调整 CLIP 的注意力。我们证明 FALIP 能有效提升 CLIP 在指代表达理解、图像分类和三维点云识别等任务中的零样本性能。实验结果进一步表明,FALIP 在大多数指标上优于现有方法,并能增强现有方法以提升其性能。