Multi-label image classification is a prediction task that aims to identify more than one label from a given image. This paper considers the semantic consistency of the latent space between the visual patch and linguistic label domains and introduces the conditional transport (CT) theory to bridge the acknowledged gap. While recent cross-modal attention-based studies have attempted to align such two representations and achieved impressive performance, they required carefully-designed alignment modules and extra complex operations in the attention computation. We find that by formulating the multi-label classification as a CT problem, we can exploit the interactions between the image and label efficiently by minimizing the bidirectional CT cost. Specifically, after feeding the images and textual labels into the modality-specific encoders, we view each image as a mixture of patch embeddings and a mixture of label embeddings, which capture the local region features and the class prototypes, respectively. CT is then employed to learn and align those two semantic sets by defining the forward and backward navigators. Importantly, the defined navigators in CT distance model the similarities between patches and labels, which provides an interpretable tool to visualize the learned prototypes. Extensive experiments on three public image benchmarks show that the proposed model consistently outperforms the previous methods.
翻译:多标签图像分类是一项旨在从给定图像中识别多个标签的预测任务。本文关注视觉补丁域与语言标签域在潜在空间中的语义一致性,引入条件输运(CT)理论来弥合这一已知鸿沟。尽管近期基于跨模态注意力的研究尝试对齐这两种表征并取得了显著性能,但它们需要精心设计的对齐模块以及注意力计算中的额外复杂操作。我们发现,通过将多标签分类建模为CT问题,能够通过最小化双向CT代价高效利用图像与标签之间的交互。具体而言,在将图像和文本标签输入特定模态编码器后,我们将每张图像视为补丁嵌入的混合与标签嵌入的混合——前者捕获局部区域特征,后者表征类别原型。随后采用CT通过定义前向与后向导航器来学习并对齐这两个语义集合。关键在于,CT距离中定义的导航器建模了补丁与标签之间的相似性,为可视化所学原型提供了可解释工具。在三个公开图像基准上的大量实验表明,所提模型始终优于以往方法。