Language models (LMs) pretrained on large corpora of text from the web have been observed to contain large amounts of various types of knowledge about the world. This observation has led to a new and exciting paradigm in knowledge graph construction where, instead of manual curation or text mining, one extracts knowledge from the parameters of an LM. Recently, it has been shown that finetuning LMs on a set of factual knowledge makes them produce better answers to queries from a different set, thus making finetuned LMs a good candidate for knowledge extraction and, consequently, knowledge graph construction. In this paper, we analyze finetuned LMs for factual knowledge extraction. We show that along with its previously known positive effects, finetuning also leads to a (potentially harmful) phenomenon which we call Frequency Shock, where at the test time the model over-predicts rare entities that appear in the training set and under-predicts common entities that do not appear in the training set enough times. We show that Frequency Shock leads to a degradation in the predictions of the model and beyond a point, the harm from Frequency Shock can even outweigh the positive effects of finetuning, making finetuning harmful overall. We then consider two solutions to remedy the identified negative effect: 1- model mixing and 2- mixture finetuning with the LM's pre-training task. The two solutions combined lead to significant improvements compared to vanilla finetuning.
翻译:在互联网大规模文本语料上预训练的语言模型已被观察到包含大量关于世界的各类知识。这一发现催生了知识图谱构建中一个崭新且令人振奋的范式:不再依赖人工整理或文本挖掘,而是从语言模型的参数中提取知识。近期研究表明,在事实知识集上微调语言模型,能使其对另一知识集的查询生成更优的回答,从而使得微调后的语言模型成为知识提取乃至知识图谱构建的优质候选方案。本文针对微调语言模型在事实知识提取中的表现展开分析。我们揭示,微调除已知的积极效应外,还会引发一种(潜在有害的)现象——我们称之为频率冲击:在测试阶段,模型过度预测训练集中出现的稀有实体,而低估未在训练集中充分出现的常见实体。研究表明,频率冲击会导致模型预测性能下降,且当恶化到一定程度时,其危害甚至可能超过微调的积极效应,最终使微调整体有害。为此,我们提出两种缓解该负面效应的解决方案:(1)模型混合;(2)结合语言模型预训练任务的混合微调。两种方案联合使用相较于标准微调方法取得了显著提升。