Language models have been supervised with both language-only objective and visual grounding in existing studies of visual-grounded language learning. However, due to differences in the distribution and scale of visual-grounded datasets and language corpora, the language model tends to mix up the context of the tokens that occurred in the grounded data with those that do not. As a result, during representation learning, there is a mismatch between the visual information and the contextual meaning of the sentence. To overcome this limitation, we propose GroundedBERT - a grounded language learning method that enhances the BERT representation with visually grounded information. GroundedBERT comprises two components: (i) the original BERT which captures the contextual representation of words learned from the language corpora, and (ii) a visual grounding module which captures visual information learned from visual-grounded datasets. Moreover, we employ Optimal Transport (OT), specifically its partial variant, to solve the fractional alignment problem between the two modalities. Our proposed method significantly outperforms the baseline language models on various language tasks of the GLUE and SQuAD datasets.
翻译:在现有视觉语言学习研究中,语言模型同时受到纯语言目标和视觉基础任务的监督。然而,由于视觉基础数据集与语言语料库在分布和规模上的差异,语言模型容易混淆出现在基础数据中的标记与未出现标记的上下文。这导致在表示学习过程中,视觉信息与句子的上下文含义之间存在不匹配。为克服这一局限,我们提出GroundedBERT——一种通过视觉基础信息增强BERT表示的基于语言学习方法。GroundedBERT包含两个组件:(i)原始BERT,用于捕获从语言语料库中学习到的单词上下文表示;(ii)视觉基础模块,用于捕获从视觉基础数据集中学习到的视觉信息。此外,我们采用最优传输(OT)及其部分变体来解决两种模态之间的分数级对齐问题。我们提出的方法在GLUE和SQuAD数据集的各种语言任务上显著优于基线语言模型。