Multi-Label Image Recognition (MLIR) is a challenging task that aims to predict multiple object labels in a single image while modeling the complex relationships between labels and image regions. Although convolutional neural networks and vision transformers have succeeded in processing images as regular grids of pixels or patches, these representations are sub-optimal for capturing irregular and discontinuous regions of interest. In this work, we present the first fully graph convolutional model, Group K-nearest neighbor based Graph convolutional Network (GKGNet), which models the connections between semantic label embeddings and image patches in a flexible and unified graph structure. To address the scale variance of different objects and to capture information from multiple perspectives, we propose the Group KGCN module for dynamic graph construction and message passing. Our experiments demonstrate that GKGNet achieves state-of-the-art performance with significantly lower computational costs on the challenging multi-label datasets, \ie MS-COCO and VOC2007 datasets. We will release the code and models to facilitate future research in this area.
翻译:多标签图像识别(MLIR)是一项具有挑战性的任务,旨在预测单张图像中的多个目标标签,同时建模标签与图像区域之间的复杂关系。尽管卷积神经网络和视觉Transformer在处理规则网格像素或图像块方面取得了成功,但这些表示形式在捕捉不规则且非连续的感兴趣区域时仍存在不足。本文首次提出了一种完全图卷积模型——基于分组K近邻的图卷积网络(GKGNet),该模型通过灵活统一的图结构建模语义标签嵌入与图像块之间的连接。为解决不同目标的尺度差异问题并从多角度捕获信息,我们提出了分组KGCN模块用于动态图构建与消息传递。实验表明,在具有挑战性的多标签数据集(即MS-COCO和VOC2007数据集)上,GKGNet以显著更低的计算成本实现了最先进的性能。我们将公开代码和模型以促进该领域的后续研究。