Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance. Recent work has shown that this can be attributed to an information imbalance: images contain more information than their captions describe. In this work, we propose TEVI, a framework that uses captions as a signal for what to retain from image embeddings. Specifically, we use sparse autoencoders to disentangle image embeddings and train a masking module to selectively reconstruct the embedding based on a given caption. In a controlled setup with synthetic captions, we show that TEVI is effective at preserving caption-described attributes while discarding others. By applying TEVI to CLIP models trained on natural images, we further achieve improved retrieval performance across coarse-grained short-caption (MS COCO, Flickr) and fine-grained long-caption (IIW, DOCCI) benchmarks, with stronger gains on richer captions, and improved robustness on the RoCOCO benchmark.
翻译:诸如CLIP的视觉-语言模型因其共享的图像-文本嵌入空间在多种任务中展现出极高价值。然而,图像与文本嵌入常存在对齐不足的问题,影响下游任务性能。近期研究证明,这归因于信息不平衡:图像包含的信息远超其标题描述的内容。本文提出TEVI框架,利用文本标题作为保留图像嵌入中哪些信息的信号。具体而言,我们采用稀疏自编码器解耦图像嵌入,并训练掩码模块根据给定标题选择性重构嵌入。在合成标题的受控实验中,我们证明TEVI能有效保留标题描述的属性同时丢弃其他信息。通过将TEVI应用于自然图像训练的CLIP模型,我们在粗粒度短标题(MS COCO、Flickr)与细粒度长标题(IIW、DOCCI)基准上均实现了检索性能提升,且对更丰富的标题取得更大增益;同时在RoCOCO基准上展现了更强的鲁棒性。