The crux of learning vision-language models is to extract semantically aligned information from visual and linguistic data. Existing attempts usually face the problem of coarse alignment, \textit{e.g.}, the vision encoder struggles in localizing an attribute-specified object. In this work, we propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs. Concretely, given an image and its paired text, we manage to parse objects (\textit{e.g.}, cat) and attributes (\textit{e.g.}, black) from the description, which are highly likely to exist in the image. It is noteworthy that the parsing pipeline is fully automatic and thus enjoys good scalability. With these parsed semantics as supervision signals, we can complement the commonly used image-text contrastive loss with the multi-tag classification loss. Extensive experimental results on a broad suite of semantic segmentation datasets substantiate the average 3.65\% improvement of our framework over existing alternatives. Furthermore, the visualization results indicate that attribute supervision makes vision-language models accurately localize attribute-specified objects. Project page can be found at https://qinying-liu.github.io/Tag-Align/
翻译:视觉-语言模型学习的核心在于从视觉与语言数据中提取语义对齐的信息。现有方法通常面临粗粒度对齐的问题,例如视觉编码器难以定位包含特定属性的物体。本文提出一种极其简单的方法,仅需图像-文本对即可实现更好的图像与文本特征对齐,无需额外数据格式。具体而言,针对给定图像及其配对文本,我们自动从描述中解析出极大概率存在于图像中的物体(如"猫")和属性(如"黑色")。值得强调的是,该解析流程完全自动化,因而具有良好的可扩展性。将这些解析语义作为监督信号,我们可在常用的图像-文本对比损失基础上补充多标签分类损失。在多个语义分割数据集上的广泛实验结果表明,我们的框架相比现有方法平均提升3.65%。此外,可视化结果表明,属性监督能使视觉-语言模型精准定位包含特定属性的物体。项目页面请见https://qinying-liu.github.io/Tag-Align/