Multiscale Superpixel Structured Difference Graph Convolutional Network for VL Representation

Within the multimodal field, the key to integrating vision and language lies in establishing a good alignment strategy. Recently, benefiting from the success of self-supervised learning, significant progress has been made in multimodal semantic representation based on pre-trained models for vision and language. However, there is still room for improvement in visual semantic representation. The lack of spatial semantic coherence and vulnerability to noise makes it challenging for current pixel or patch-based methods to accurately extract complex scene boundaries. To this end, this paper develops superpixel as a comprehensive compact representation of learnable image data, which effectively reduces the number of visual primitives for subsequent processing by clustering perceptually similar pixels. To mine more precise topological relations, we propose a Multiscale Difference Graph Convolutional Network (MDGCN). It parses the entire image as a fine-to-coarse hierarchical structure of constituent visual patterns, and captures multiscale features by progressively merging adjacent superpixels as graph nodes. Moreover, we predict the differences between adjacent nodes through the graph structure, facilitating key information aggregation of graph nodes to reason actual semantic relations. Afterward, we design a multi-level fusion rule in a bottom-up manner to avoid understanding deviation by learning complementary spatial information at different regional scales. Our proposed method can be well applied to multiple downstream task learning. Extensive experiments demonstrate that our method is competitive with other state-of-the-art methods in visual reasoning. Our code will be released upon publication.

翻译：在多模态领域中，视觉与语言融合的关键在于建立良好的对齐策略。近年来，得益于自监督学习的成功，基于视觉与语言预训练模型的多模态语义表示取得了显著进展。然而，视觉语义表示仍存在改进空间。由于缺乏空间语义连贯性且易受噪声干扰，当前基于像素或图像块的方法难以准确提取复杂场景边界。为此，本文提出将超像素作为可学习图像数据的紧凑综合表示，通过聚类感知相似的像素，有效减少后续处理的视觉基元数量。为挖掘更精确的拓扑关系，我们提出多尺度差异图卷积网络（MDGCN）。该网络将整幅图像解析为由细到粗的视觉模式层级结构，并通过逐步合并相邻超像素作为图节点来捕获多尺度特征。此外，我们通过图结构预测相邻节点间的差异，促进图节点关键信息聚合以推理实际语义关系。随后，我们设计了一种自底向上的多层级融合规则，通过在不同区域尺度上学习互补的空间信息，避免理解偏差。所提方法可良好应用于多项下游任务学习。大量实验表明，本方法在视觉推理中与其他最先进方法具有竞争力。相关代码将在论文发表后公开。