Although pre-trained language models~(PLMs) have shown impressive performance by text-only self-supervised training, they are found lack of visual semantics or commonsense. Existing solutions often rely on explicit images for visual knowledge augmentation (requiring time-consuming retrieval or generation), and they also conduct the augmentation for the whole input text, without considering whether it is actually needed in specific inputs or tasks. To address these issues, we propose a novel \textbf{V}isually-\textbf{A}ugmented fine-tuning approach that can be generally applied to various PLMs or NLP tasks, \textbf{W}ithout using any retrieved or generated \textbf{I}mages, namely \textbf{VAWI}. Experimental results show that our approach can consistently improve the performance of BERT, RoBERTa, BART, and T5 at different scales, and outperform several competitive baselines on ten tasks. Our codes and data are publicly available at~\url{https://github.com/RUCAIBox/VAWI}.
翻译:尽管仅通过文本自监督训练的预训练语言模型(PLMs)展现出显著性能,但它们被发现缺乏视觉语义或常识知识。现有解决方案往往依赖显式图像进行视觉知识增强(需要耗时检索或生成),且对整个输入文本进行增强,未考虑特定输入或任务是否真正需要这种增强。为解决这些问题,我们提出了一种新颖的视觉增强微调方法(Visual-Augmented fine-tuning Without using any retrieved or generated Images,简称VAWI),该方法可通用地应用于多种PLMs或NLP任务,且无需使用任何检索或生成的图像。实验结果表明,我们的方法能够持续提升不同规模下BERT、RoBERTa、BART和T5的性能,并在十项任务上优于多个竞争性基线。我们的代码和数据已在https://github.com/RUCAIBox/VAWI上公开。