Humans learn language via multi-modal knowledge. However, due to the text-only pre-training scheme, most existing pre-trained language models (PLMs) are hindered from the multi-modal information. To inject visual knowledge into PLMs, existing methods incorporate either the text or image encoder of vision-language models (VLMs) to encode the visual information and update all the original parameters of PLMs for knowledge fusion. In this paper, we propose a new plug-and-play module, X-adapter, to flexibly leverage the aligned visual and textual knowledge learned in pre-trained VLMs and efficiently inject them into PLMs. Specifically, we insert X-adapters into PLMs, and only the added parameters are updated during adaptation. To fully exploit the potential in VLMs, X-adapters consist of two sub-modules, V-expert and T-expert, to fuse VLMs' image and text representations, respectively. We can opt for activating different sub-modules depending on the downstream tasks. Experimental results show that our method can significantly improve the performance on object-color reasoning and natural language understanding (NLU) tasks compared with PLM baselines.
翻译:人类通过多模态知识学习语言。然而,由于仅采用文本预训练机制,现有预训练语言模型(PLM)难以利用多模态信息。为向PLM注入视觉知识,现有方法通常直接使用视觉-语言模型(VLM)的文本或图像编码器编码视觉信息,并更新PLM全部原始参数以实现知识融合。本文提出一种新型即插即用模块X-adapter,可灵活利用预训练VLM中已对齐的视觉与文本知识,并将其高效注入PLM。具体而言,我们在PLM中插入X-adapter,仅更新新增参数。为充分挖掘VLM潜力,X-adapter包含V-expert与T-expert两个子模块,分别用于融合VLM的图像与文本表征。根据下游任务需求,可选择激活不同子模块。实验结果表明,与PLM基线相比,本方法在物体颜色推理与自然语言理解(NLU)任务上均取得显著性能提升。