Since commonsense information has been recorded significantly less frequently than its existence, language models pre-trained by text generation have difficulty to learn sufficient commonsense knowledge. Several studies have leveraged text retrieval to augment the models' commonsense ability. Unlike text, images capture commonsense information inherently but little effort has been paid to effectively utilize them. In this work, we propose a novel Multi-mOdal REtrieval (MORE) augmentation framework, to leverage both text and images to enhance the commonsense ability of language models. Extensive experiments on the Common-Gen task have demonstrated the efficacy of MORE based on the pre-trained models of both single and multiple modalities.
翻译:摘要:由于常识信息的记录频率远低于其存在频率,基于文本生成的预训练语言模型难以学习到充足的常识知识。已有研究利用文本检索来增强模型的常识能力。与文本不同,图像天然蕴含常识信息,但尚未得到有效利用。本文提出一种新颖的多模态检索(MORE)增强框架,通过同时利用文本和图像来提升语言模型的常识能力。在Common-Gen任务上的大量实验表明,基于单模态和多模态预训练模型的MORE框架均具有显著有效性。