Multi-label Recognition (MLR) involves the identification of multiple objects within an image. To address the additional complexity of this problem, recent works have leveraged information from vision-language models (VLMs) trained on large text-images datasets for the task. These methods learn an independent classifier for each object (class), overlooking correlations in their occurrences. Such co-occurrences can be captured from the training data as conditional probabilities between a pair of classes. We propose a framework to extend the independent classifiers by incorporating the co-occurrence information for object pairs to improve the performance of independent classifiers. We use a Graph Convolutional Network (GCN) to enforce the conditional probabilities between classes, by refining the initial estimates derived from image and text sources obtained using VLMs. We validate our method on four MLR datasets, where our approach outperforms all state-of-the-art methods.
翻译:多标签识别(MLR)涉及图像中多个目标的识别。为应对该问题的额外复杂性,近期研究利用在大规模文本-图像数据集上训练的视觉-语言模型(VLM)来执行此任务。这些方法为每个目标(类别)学习独立的分类器,忽略了其出现时的相关性。这类共现关系可从训练数据中捕获为类别对之间的条件概率。我们提出一种框架,通过引入对象对间的共现信息来扩展独立分类器,从而提升其性能。我们使用图卷积网络(GCN)对基于VLM从图像和文本源获得的初始估计进行精炼,以强制施加类别间的条件概率。我们在四个MLR数据集上验证了该方法,结果表明我们的方法优于所有当前最先进的方法。