CLIP-based prompt tuning enables pretrained Vision-Language Models (VLMs) to efficiently adapt to downstream tasks. Although existing studies have made significant progress, they pay limited attention to changes in the internal attention representations of VLMs during the tuning process. In this paper, we attribute the failure modes of prompt tuning predictions to shifts in foreground attention of the visual encoder, and propose Foreground View-Guided Prompt Tuning (FVG-PT), an adaptive plug-and-play foreground attention guidance module, to alleviate the shifts. Concretely, FVG-PT introduces a learnable Foreground Reliability Gate to automatically enhance the foreground view quality, applies a Foreground Distillation Compensation module to guide visual attention toward the foreground, and further introduces a Prior Calibration module to mitigate generalization degradation caused by excessive focus on the foreground. Experiments on multiple backbone models and datasets show the effectiveness and compatibility of FVG-PT. Codes are available at: https://github.com/JREion/FVG-PT
翻译:基于CLIP的提示调优使得预训练的视觉语言模型能够高效地适应下游任务。尽管现有研究已取得显著进展,但它们对调优过程中视觉语言模型内部注意力表征的变化关注有限。本文认为,提示调优预测的失败模式可归因于视觉编码器前景注意力的偏移,并提出前景视图引导提示调优(FVG-PT),一种自适应的即插即用前景注意力引导模块,以缓解此类偏移。具体而言,FVG-PT引入可学习的前景可靠性门控以自动提升前景视图质量,应用前景蒸馏补偿模块引导视觉注意力聚焦于前景区域,并进一步引入先验校准模块以缓解因过度关注前景而导致的泛化性能下降。在多个骨干模型和数据集上的实验验证了FVG-PT的有效性与兼容性。代码发布于:https://github.com/JREion/FVG-PT