Pre-trained language models (PLMs) like BERT have made significant progress in various downstream NLP tasks. However, by asking models to do cloze-style tests, recent work finds that PLMs are short in acquiring knowledge from unstructured text. To understand the internal behaviour of PLMs in retrieving knowledge, we first define knowledge-baring (K-B) tokens and knowledge-free (K-F) tokens for unstructured text and ask professional annotators to label some samples manually. Then, we find that PLMs are more likely to give wrong predictions on K-B tokens and attend less attention to those tokens inside the self-attention module. Based on these observations, we develop two solutions to help the model learn more knowledge from unstructured text in a fully self-supervised manner. Experiments on knowledge-intensive tasks show the effectiveness of the proposed methods. To our best knowledge, we are the first to explore fully self-supervised learning of knowledge in continual pre-training.
翻译:预训练语言模型(如BERT)已在各种下游自然语言处理任务中取得了显著进展。然而,通过让模型进行完形填空测试,近期研究发现预训练语言模型在从非结构化文本中获取知识方面存在不足。为理解预训练语言模型在检索知识时的内部行为,我们首先将非结构化文本中的词元定义为知识承载词元(K-B)和非知识承载词元(K-F),并请专业标注人员对部分样本进行人工标注。接着,我们发现预训练语言模型更易对知识承载词元给出错误预测,且在自注意力模块中对其关注度较低。基于这些发现,我们提出了两种解决方案,以完全自监督的方式帮助模型从非结构化文本中学习更多知识。在知识密集型任务上的实验证明了所提出方法的有效性。据我们所知,这是首次在持续预训练中探索完全自监督的知识学习。