Humans learn language via multi-modal knowledge. However, due to the text-only pre-training scheme, most existing pre-trained language models (PLMs) are hindered from the multi-modal information. To inject visual knowledge into PLMs, existing methods incorporate either the text or image encoder of vision-language models (VLMs) to encode the visual information and update all the original parameters of PLMs for knowledge fusion. In this paper, we propose a new plug-and-play module, X-adapter, to flexibly leverage the aligned visual and textual knowledge learned in pre-trained VLMs and efficiently inject them into PLMs. Specifically, we insert X-adapters into PLMs, and only the added parameters are updated during adaptation. To fully exploit the potential in VLMs, X-adapters consist of two sub-modules, V-expert and T-expert, to fuse VLMs' image and text representations, respectively. We can opt for activating different sub-modules depending on the downstream tasks. Experimental results show that our method can significantly improve the performance on object-color reasoning and natural language understanding (NLU) tasks compared with PLM baselines.
翻译:人类通过多模态知识学习语言。然而,由于仅基于文本的预训练机制,大多数现有预训练语言模型(PLMs)难以利用多模态信息。为将视觉知识注入PLMs,现有方法通常借助视觉-语言模型(VLMs)的文本或图像编码器编码视觉信息,并更新PLMs的全部原始参数以实现知识融合。本文提出一种新型即插即用模块——X-适配器,可灵活利用预训练VLM中已对齐的视觉与文本知识,并将其高效注入PLMs。具体而言,我们将X-适配器插入PLMs,仅更新新增参数即可完成适应。为充分挖掘VLM潜力,X-适配器包含两个子模块:V-专家与T-专家,分别用于融合VLM的图像与文本表征。可根据下游任务需求选择激活不同子模块。实验结果表明,与PLM基线相比,本方法在物体颜色推理与自然语言理解(NLU)任务上均能显著提升性能。