We propose $\mathbb{VD}$-$\mathbb{GR}$ - a novel visual dialog model that combines pre-trained language models (LMs) with graph neural networks (GNNs). Prior works mainly focused on one class of models at the expense of the other, thus missing out on the opportunity of combining their respective benefits. At the core of $\mathbb{VD}$-$\mathbb{GR}$ is a novel integration mechanism that alternates between spatial-temporal multi-modal GNNs and BERT layers, and that covers three distinct contributions: First, we use multi-modal GNNs to process the features of each modality (image, question, and dialog history) and exploit their local structures before performing BERT global attention. Second, we propose hub-nodes that link to all other nodes within one modality graph, allowing the model to propagate information from one GNN (modality) to the other in a cascaded manner. Third, we augment the BERT hidden states with fine-grained multi-modal GNN features before passing them to the next $\mathbb{VD}$-$\mathbb{GR}$ layer. Evaluations on VisDial v1.0, VisDial v0.9, VisDialConv, and VisPro show that $\mathbb{VD}$-$\mathbb{GR}$ achieves new state-of-the-art results across all four datasets.
翻译:我们提出$\mathbb{VD}$-$\mathbb{GR}$——一种融合预训练语言模型(LM)与图神经网络(GNN)的新型视觉对话模型。先前工作主要侧重于某一类模型而忽视了另一类,从而错失了结合两者优势的机会。$\mathbb{VD}$-$\mathbb{GR}$的核心是一种新型集成机制,它在时空多模态GNN与BERT层之间交替进行,包含三项独特贡献:首先,我们使用多模态GNN处理每种模态(图像、问题和对话历史)的特征,并在执行BERT全局注意力之前利用其局部结构。其次,我们提出枢纽节点,它们连接单一模态图内的所有其他节点,使模型能够以级联方式将信息从一个GNN(模态)传播到另一个。第三,在将BERT隐藏状态传递至下一个$\mathbb{VD}$-$\mathbb{GR}$层之前,我们用细粒度的多模态GNN特征对其进行增强。在VisDial v1.0、VisDial v0.9、VisDialConv和VisPro上的评估表明,$\mathbb{VD}$-$\mathbb{GR}$在所有四个数据集上均取得了新的最佳结果。