Multiscale Superpixel Structured Difference Graph Convolutional Network for VL Representation

Within the multimodal field, the key to integrating vision and language lies in establishing a good alignment strategy. Recently, benefiting from the success of self-supervised learning, significant progress has been made in multimodal semantic representation based on pre-trained models for vision and language. However, there is still room for improvement in visual semantic representation. The lack of spatial semantic coherence and vulnerability to noise makes it challenging for current pixel or patch-based methods to accurately extract complex scene boundaries. To this end, this paper develops superpixel as a comprehensive compact representation of learnable image data, which effectively reduces the number of visual primitives for subsequent processing by clustering perceptually similar pixels. To mine more precise topological relations, we propose a Multiscale Difference Graph Convolutional Network (MDGCN). It parses the entire image as a fine-to-coarse hierarchical structure of constituent visual patterns, and captures multiscale features by progressively merging adjacent superpixels as graph nodes. Moreover, we predict the differences between adjacent nodes through the graph structure, facilitating key information aggregation of graph nodes to reason actual semantic relations. Afterward, we design a multi-level fusion rule in a bottom-up manner to avoid understanding deviation by learning complementary spatial information at different regional scales. Our proposed method can be well applied to multiple downstream task learning. Extensive experiments demonstrate that our method is competitive with other state-of-the-art methods in visual reasoning. Our code will be released upon publication.

翻译：在多模态领域中，视觉与语言的融合关键在于建立良好的对齐策略。近年来，得益于自监督学习的成功，基于视觉语言预训练模型的多模态语义表征取得了显著进展。然而，视觉语义表征仍有改进空间。由于缺乏空间语义连贯性且易受噪声干扰，当前基于像素或图像块的方法难以准确提取复杂的场景边界。为此，本文提出将超像素作为可学习图像数据的紧凑综合表征，通过聚类感知相似的像素，有效减少后续处理的视觉基元数量。为挖掘更精确的拓扑关系，我们提出多尺度差异图卷积网络（MDGCN）。该网络将整幅图像解析为由粗到细的层级结构，通过逐步合并相邻超像素作为图节点来捕获多尺度特征。此外，我们利用图结构预测相邻节点间的差异，促进图节点关键信息聚合以推理实际语义关系。随后，设计自下而上的多层融合规则，通过在不同区域尺度学习互补空间信息，避免理解偏差。所提方法可良好适配多种下游任务学习。大量实验证明，该方法在视觉推理任务中与当前最优方法相比具有竞争力。相关代码将于论文发表后公开。