Pre-trained Language Models (PLMs) are trained on vast unlabeled data, rich in world knowledge. This fact has sparked the interest of the community in quantifying the amount of factual knowledge present in PLMs, as this explains their performance on downstream tasks, and potentially justifies their use as knowledge bases. In this work, we survey methods and datasets that are used to probe PLMs for factual knowledge. Our contributions are: (1) We propose a categorization scheme for factual probing methods that is based on how their inputs, outputs and the probed PLMs are adapted; (2) We provide an overview of the datasets used for factual probing; (3) We synthesize insights about knowledge retention and prompt optimization in PLMs, analyze obstacles to adopting PLMs as knowledge bases and outline directions for future work.
翻译:预训练语言模型(PLMs)在海量未标注数据上训练,富含世界知识。这一事实引发了学界对量化PLMs中事实知识量的兴趣,因为这解释了它们在下游任务中的表现,并可能证明其作为知识库的合理性。本文综述了用于探测PLMs中事实知识的方法和数据集。我们的贡献包括:(1)提出一种基于输入、输出及被探针PLMs调整方式的事实探针方法分类方案;(2)概述用于事实探针的数据集;(3)综合PLMs中知识保留和提示优化的见解,分析采用PLMs作为知识库的障碍,并概述未来工作方向。