Multi-label image classification is a prediction task that aims to identify more than one label from a given image. This paper considers the semantic consistency of the latent space between the visual patch and linguistic label domains and introduces the conditional transport (CT) theory to bridge the acknowledged gap. While recent cross-modal attention-based studies have attempted to align such two representations and achieved impressive performance, they required carefully-designed alignment modules and extra complex operations in the attention computation. We find that by formulating the multi-label classification as a CT problem, we can exploit the interactions between the image and label efficiently by minimizing the bidirectional CT cost. Specifically, after feeding the images and textual labels into the modality-specific encoders, we view each image as a mixture of patch embeddings and a mixture of label embeddings, which capture the local region features and the class prototypes, respectively. CT is then employed to learn and align those two semantic sets by defining the forward and backward navigators. Importantly, the defined navigators in CT distance model the similarities between patches and labels, which provides an interpretable tool to visualize the learned prototypes. Extensive experiments on three public image benchmarks show that the proposed model consistently outperforms the previous methods. Our code is available at https://github.com/keepgoingjkg/PatchCT.
翻译:多标签图像分类是一项旨在从给定图像中识别多个标签的预测任务。本文关注视觉补丁域与语言标签域在潜在空间中的语义一致性,并引入条件传输(CT)理论来弥合这一公认的差距。尽管近期基于跨模态注意力的研究尝试对齐这两种表示并取得了显著性能,但它们需要精心设计的对齐模块以及注意力计算中的额外复杂操作。我们发现,通过将多标签分类问题形式化为CT问题,可以通过最小化双向CT代价来高效利用图像与标签之间的交互。具体而言,将图像和文本标签输入各自模态的编码器后,我们将每张图像视为补丁嵌入的混合体,并将标签视为标签嵌入的混合体,前者捕捉局部区域特征,后者表征类别原型。随后,通过定义前向与后向导航器,CT被用于学习并对齐这两组语义集合。重要的是,CT距离中定义的导航器建模了补丁与标签之间的相似性,为可视化学习到的原型提供了可解释工具。在三个公开图像基准上的大量实验表明,所提模型持续优于先前方法。我们的代码开源在https://github.com/keepgoingjkg/PatchCT。