Compositional understanding allows visual language models to interpret complex relationships between objects, attributes, and relations in images and text. However, most existing methods often rely on hard negative examples and fine-tuning, which can overestimate improvements and are limited by the difficulty of obtaining hard negatives. In this work, we introduce Zero-Shot Compositional Understanding (ZS-CU), a novel task that enhances compositional understanding without requiring hard negative training data. We propose YUKINO (Yielded Compositional Understanding Knowledge via Textual Inversion with NO), which uses textual inversion to map unlabeled images to pseudo-tokens in a pre-trained CLIP model. We propose introducing "no" logical regularization to address the issue of token interaction in inversion. Additionally, we suggest using knowledge distillation to reduce the time complexity of textual inversion. Experimental results show that YUKINO outperforms the existing multi-modal SOTA models by over 8% on the SugarCREPE benchmark, and also achieves significant improvements in image retrieval tasks.
翻译:组合理解能力使视觉语言模型能够解释图像与文本中物体、属性及关系间的复杂关联。然而,现有方法大多依赖困难负例与微调,这可能会高估改进效果,且受限于获取困难负例的难度。本文提出零样本组合理解(ZS-CU),这是一项无需困难负例训练数据即可增强组合理解能力的新任务。我们提出YUKINO(通过带“否”逻辑的文本反演实现组合理解知识获取),该方法利用文本反演将未标注图像映射到预训练CLIP模型中的伪标记。我们引入“否”逻辑正则化以解决反演过程中的标记交互问题。此外,我们建议采用知识蒸馏来降低文本反演的时间复杂度。实验结果表明,YUKINO在SugarCREPE基准上超越现有多模态SOTA模型超过8%,同时在图像检索任务中也取得显著提升。