Humans learn language via multi-modal knowledge. However, due to the text-only pre-training scheme, most existing pre-trained language models (PLMs) are hindered from the multi-modal information. To inject visual knowledge into PLMs, existing methods incorporate either the text or image encoder of vision-language models (VLMs) to encode the visual information and update all the original parameters of PLMs for knowledge fusion. In this paper, we propose a new plug-and-play module, X-adapter, to flexibly leverage the aligned visual and textual knowledge learned in pre-trained VLMs and efficiently inject them into PLMs. Specifically, we insert X-adapters into PLMs, and only the added parameters are updated during adaptation. To fully exploit the potential in VLMs, X-adapters consist of two sub-modules, V-expert and T-expert, to fuse VLMs' image and text representations, respectively. We can opt for activating different sub-modules depending on the downstream tasks. Experimental results show that our method can significantly improve the performance on object-color reasoning and natural language understanding (NLU) tasks compared with PLM baselines.
翻译:人类通过多模态知识学习语言。然而,由于仅依赖文本的预训练机制,现有预训练语言模型(PLMs)难以利用多模态信息。为向PLMs注入视觉知识,现有方法通常借助视觉语言模型(VLMs)的文本或图像编码器编码视觉信息,并通过更新PLMs全部原始参数实现知识融合。本文提出一种即插即用模块X-adapter,可灵活利用预训练VLM中学习到的对齐视觉与文本知识,并将其高效注入PLMs。具体而言,我们将X-adapter插入PLMs中,仅更新新增参数进行适配。为充分挖掘VLM潜力,X-adapter包含V-expert和T-expert两个子模块,分别融合VLM的图像与文本表征。我们可根据下游任务需求选择激活不同子模块。实验结果表明,与PLM基线相比,本方法在物体颜色推理和自然语言理解(NLU)任务上显著提升了性能。