We propose the Vision-and-Augmented-Language Transformer (VAuLT). VAuLT is an extension of the popular Vision-and-Language Transformer (ViLT), and improves performance on vision-and-language (VL) tasks that involve more complex text inputs than image captions while having minimal impact on training and inference efficiency. ViLT, importantly, enables efficient training and inference in VL tasks, achieved by encoding images using a linear projection of patches instead of an object detector. However, it is pretrained on captioning datasets, where the language input is simple, literal, and descriptive, therefore lacking linguistic diversity. So, when working with multimedia data in the wild, such as multimodal social media data, there is a notable shift from captioning language data, as well as diversity of tasks. We indeed find evidence that the language capacity of ViLT is lacking. The key insight and novelty of VAuLT is to propagate the output representations of a large language model (LM) like BERT to the language input of ViLT. We show that joint training of the LM and ViLT can yield relative improvements up to 20% over ViLT and achieve state-of-the-art or comparable performance on VL tasks involving richer language inputs and affective constructs, such as for Target-Oriented Sentiment Classification in TWITTER-2015 and TWITTER-2017, and Sentiment Classification in MVSA-Single and MVSA-Multiple. Our code is available at https://github.com/gchochla/VAuLT.
翻译:我们提出了视觉-增强语言Transformer(VAuLT)。VAuLT是流行的视觉-语言Transformer(ViLT)的扩展,旨在改进涉及比图像描述更复杂文本输入的视觉-语言(VL)任务性能,同时最大程度降低对训练和推理效率的影响。ViLT的关键优势在于通过使用图像块线性投影代替目标检测器进行图像编码,从而在VL任务中实现高效的训练和推理。然而,ViLT在描述数据集上预训练时,其语言输入简单、字面且描述性较强,因此缺乏语言多样性。当处理现实中的多媒体数据(如多模态社交媒体数据)时,我们会发现从描述性语言数据到任务多样性均存在显著差异。我们确实发现ViLT的语言能力有所欠缺。VAuLT的关键创新在于将大型语言模型(LM)(如BERT)的输出表示传播至ViLT的语言输入。实验表明,联合训练LM和ViLT相比ViLT可获得最高20%的相对改进,并在涉及更丰富语言输入和情感建构的VL任务中达到或超越当前最优性能,例如TWITTER-2015和TWITTER-2017上的面向目标情感分类任务,以及MVSA-Single和MVSA-Multiple上的情感分类任务。我们的代码已开源至https://github.com/gchochla/VAuLT。