Latent image representations arising from vision-language models have proved immensely useful for a variety of downstream tasks. However, their utility is limited by their entanglement with respect to different visual attributes. For instance, recent work has shown that CLIP image representations are often biased toward specific visual properties (such as objects or actions) in an unpredictable manner. In this paper, we propose to separate representations of the different visual modalities in CLIP's joint vision-language space by leveraging the association between parts of speech and specific visual modes of variation (e.g. nouns relate to objects, adjectives describe appearance). This is achieved by formulating an appropriate component analysis model that learns subspaces capturing variability corresponding to a specific part of speech, while jointly minimising variability to the rest. Such a subspace yields disentangled representations of the different visual properties of an image or text in closed form while respecting the underlying geometry of the manifold on which the representations lie. What's more, we show the proposed model additionally facilitates learning subspaces corresponding to specific visual appearances (e.g. artists' painting styles), which enables the selective removal of entire visual themes from CLIP-based text-to-image synthesis. We validate the model both qualitatively, by visualising the subspace projections with a text-to-image model and by preventing the imitation of artists' styles, and quantitatively, through class invariance metrics and improvements to baseline zero-shot classification.
翻译:视觉语言模型生成的潜在图像表示已证明对多种下游任务极为有用。然而,由于不同视觉属性之间存在纠缠,其效用受到限制。例如,近期研究表明CLIP图像表示往往以不可预测的方式偏向特定视觉属性(如物体或动作)。本文提出通过利用词性与特定视觉变化模式(如名词关联物体、形容词描述外观)之间的关联,在CLIP的联合视觉语言空间中分离不同视觉模态的表示。具体而言,我们通过构建合适的成分分析模型,学习能够捕获特定词性对应变异性的子空间,同时最小化其他变异。该子空间以闭式解形式获得解缠的图像或文本不同视觉属性表示,同时保留表示所在流形的底层几何结构。此外,我们证明该模型还能促进学习对应特定视觉外观(如艺术家绘画风格)的子空间,从而在基于CLIP的文本到图像合成中有选择性地移除完整视觉主题。我们通过文本到图像模型可视化子空间投影、防止艺术家风格模仿进行定性验证,并通过类不变性指标及基线零样本分类改进进行定量验证。