Humans learn language via multi-modal knowledge. However, due to the text-only pre-training scheme, most existing pre-trained language models (PLMs) are hindered from the multi-modal information. To inject visual knowledge into PLMs, existing methods incorporate either the text or image encoder of vision-language models (VLMs) to encode the visual information and update all the original parameters of PLMs for knowledge fusion. In this paper, we propose a new plug-and-play module, X-adapter, to flexibly leverage the aligned visual and textual knowledge learned in pre-trained VLMs and efficiently inject them into PLMs. Specifically, we insert X-adapters into PLMs, and only the added parameters are updated during adaptation. To fully exploit the potential in VLMs, X-adapters consist of two sub-modules, V-expert and T-expert, to fuse VLMs' image and text representations, respectively. We can opt for activating different sub-modules depending on the downstream tasks. Experimental results show that our method can significantly improve the performance on object-color reasoning and natural language understanding (NLU) tasks compared with PLM baselines.
翻译:人类通过多模态信息学习语言。然而,由于仅基于文本的预训练机制,现有预训练语言模型大多难以获取多模态信息。为向预训练语言模型注入视觉知识,现有方法通常引入视觉语言模型中的文本或图像编码器来编码视觉信息,并通过更新预训练语言模型的所有原始参数实现知识融合。本文提出一种新型即插即用模块X-adapter,可灵活利用预训练视觉语言模型中已对齐的视觉与文本知识,高效将其注入预训练语言模型。具体而言,我们将X-adapter嵌入预训练语言模型,且在适配过程中仅更新新增参数。为充分发掘视觉语言模型的潜力,X-adapter包含V-expert和T-expert两个子模块,分别融合视觉语言模型的图像与文本表征。根据下游任务需求,可选择激活不同子模块。实验结果表明,与预训练语言模型基线相比,本方法在物体颜色推理与自然语言理解任务中显著提升性能。