Large-scale commonsense knowledge bases empower a broad range of AI applications, where the automatic extraction of commonsense knowledge (CKE) is a fundamental and challenging problem. CKE from text is known for suffering from the inherent sparsity and reporting bias of commonsense in text. Visual perception, on the other hand, contains rich commonsense knowledge about real-world entities, e.g., (person, can_hold, bottle), which can serve as promising sources for acquiring grounded commonsense knowledge. In this work, we present CLEVER, which formulates CKE as a distantly supervised multi-instance learning problem, where models learn to summarize commonsense relations from a bag of images about an entity pair without any human annotation on image instances. To address the problem, CLEVER leverages vision-language pre-training models for deep understanding of each image in the bag, and selects informative instances from the bag to summarize commonsense entity relations via a novel contrastive attention mechanism. Comprehensive experimental results in held-out and human evaluation show that CLEVER can extract commonsense knowledge in promising quality, outperforming pre-trained language model-based methods by 3.9 AUC and 6.4 mAUC points. The predicted commonsense scores show strong correlation with human judgment with a 0.78 Spearman coefficient. Moreover, the extracted commonsense can also be grounded into images with reasonable interpretability. The data and codes can be obtained at https://github.com/thunlp/CLEVER.
翻译:大规模常识知识库支撑着广泛的AI应用,其中常识知识的自动提取(CKE)是一个基础且具有挑战性的问题。基于文本的常识知识提取因文本中常识的固有稀疏性和报告偏差而备受困扰。而视觉感知包含关于真实世界实体的丰富常识知识,例如(人,可握持,瓶子),可作为获取锚定常识知识的有前景的来源。在本工作中,我们提出CLEVER方法,将常识知识提取形式化为远程监督的多实例学习问题,其中模型学习从关于实体对的一批图像中总结常识关系,无需任何图像实例的人工标注。为解决这一问题,CLEVER利用视觉语言预训练模型深入理解批次中的每张图像,并通过新颖的对比注意力机制从批次中选取信息丰富的实例来总结常识实体关系。在留出集和人工评估中的综合实验结果表明,CLEVER能够以高质量提取常识知识,在AUC和mAUC指标上分别超越基于预训练语言模型的方法3.9和6.4个百分点。预测的常识得分与人类判断之间呈现强相关性,斯皮尔曼系数达0.78。此外,提取的常识还可锚定至图像并具有合理的可解释性。数据和代码可访问 https://github.com/thunlp/CLEVER 获取。