Vision-Language Models (VLMs) like CLIP struggle to understand negation, often embedding affirmatives and negatives similarly (e.g., matching "no dog" with dog images). Existing methods refine negation understanding via fine-tuning CLIP's text encoder, risking overfitting. In this work, we propose CLIPGlasses, a plug-and-play framework that enhances CLIP's ability to comprehend negated visual descriptions. CLIPGlasses adopts a dual-stage design: a Lens module disentangles negated semantics from text embeddings, and a Frame module predicts context-aware repulsion strength, which is integrated into a modified similarity computation to penalize alignment with negated semantics, thereby reducing false positive matches. Experiments show that CLIP equipped with CLIPGlasses achieves competitive in-domain performance and outperforms state-of-the-art methods in cross-domain generalization. Its superiority is especially evident under low-resource conditions, indicating stronger robustness across domains.
翻译:视觉-语言模型(如CLIP)在理解否定语义时存在困难,常将肯定与否定表述嵌入到相似向量空间中(例如将“无狗”与包含狗的图像错误匹配)。现有方法通常通过微调CLIP的文本编码器来改进否定理解能力,但存在过拟合风险。本研究提出CLIPGlasses——一种即插即用框架,可增强CLIP对否定性视觉描述的理解能力。CLIPGlasses采用双阶段设计:透镜模块从文本嵌入中解耦否定语义,框架模块预测上下文感知的排斥强度,并通过改进的相似度计算将排斥强度整合到匹配过程中,从而惩罚与否定语义的错误对齐,降低误匹配率。实验表明,搭载CLIPGlasses的CLIP在领域内任务中取得可比性能,并在跨领域泛化方面超越现有最优方法。该框架在低资源条件下的优势尤为显著,展现出更强的跨领域鲁棒性。