The crux of learning vision-language models is to extract semantically aligned information from visual and linguistic data. Existing attempts usually face the problem of coarse alignment, e.g., the vision encoder struggles in localizing an attribute-specified object. In this work, we propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs. Concretely, given an image and its paired text, we manage to parse objects (e.g., cat) and attributes (e.g., black) from the description, which are highly likely to exist in the image. It is noteworthy that the parsing pipeline is fully automatic and thus enjoys good scalability. With these parsed semantics as supervision signals, we can complement the commonly used image-text contrastive loss with the multi-tag classification loss. Extensive experimental results on a broad suite of semantic segmentation datasets substantiate the average 5.2\% improvement of our framework over existing alternatives. Furthermore, the visualization results indicate that attribute supervision makes vision-language models accurately localize attribute-specified objects. Project page can be found at https://qinying-liu.github.io/Tag-Align.
翻译:学习视觉-语言模型的核心在于从视觉和语言数据中提取语义对齐的信息。现有方法通常面临粗粒度对齐的问题,例如视觉编码器难以定位具有特定属性的对象。本文提出了一种极其简单的方法,无需除图像-文本对之外的额外数据格式,即可更好地对齐图像与文本特征。具体而言,给定一张图像及其配对的文本,我们自动从描述中解析出极有可能存在于图像中的对象(如“猫”)和属性(如“黑色”)。值得注意的是,该解析流程完全自动化,因此具有良好的可扩展性。以这些解析出的语义作为监督信号,我们能够将常用的图像-文本对比损失与多标签分类损失相结合。在广泛语义分割数据集上的大量实验表明,我们的框架相较于现有替代方案平均提升了5.2%。此外,可视化结果证明,属性监督使视觉-语言模型能够准确定位具有特定属性的对象。项目页面可访问 https://qinying-liu.github.io/Tag-Align。