In recent years, Pre-trained Language Models (PLMs) have shown their superiority by pre-training on unstructured text corpus and then fine-tuning on downstream tasks. On entity-rich textual resources like Wikipedia, Knowledge-Enhanced PLMs (KEPLMs) incorporate the interactions between tokens and mentioned entities in pre-training, and are thus more effective on entity-centric tasks such as entity linking and relation classification. Although exploiting Wikipedia's rich structures to some extent, conventional KEPLMs still neglect a unique layout of the corpus where each Wikipedia page is around a topic entity (identified by the page URL and shown in the page title). In this paper, we demonstrate that KEPLMs without incorporating the topic entities will lead to insufficient entity interaction and biased (relation) word semantics. We thus propose KEPLET, a novel Knowledge-Enhanced Pre-trained LanguagE model with Topic entity awareness. In an end-to-end manner, KEPLET identifies where to add the topic entity's information in a Wikipedia sentence, fuses such information into token and mentioned entities representations, and supervises the network learning, through which it takes topic entities back into consideration. Experiments demonstrated the generality and superiority of KEPLET which was applied to two representative KEPLMs, achieving significant improvements on four entity-centric tasks.
翻译:近年来,预训练语言模型通过在大规模非结构化文本语料上进行预训练,并在下游任务上微调,展现了其优越性。在像维基百科这样的实体丰富文本资源上,知识增强预训练语言模型通过在预训练阶段整合标记与提及实体之间的交互,从而在实体链接和关系分类等以实体为中心的任务中更加有效。尽管在一定程度上利用了维基百科的丰富结构,传统知识增强预训练语言模型仍然忽略了语料库中一个独特的布局——每个维基百科页面都围绕一个主题实体(由页面URL标识并在页面标题中显示)。在本文中,我们证明未整合主题实体的知识增强预训练语言模型会导致实体交互不足和(关系)词义偏差。因此,我们提出了KEPLET——一种新颖的具有主题实体感知的知识增强预训练语言模型。KEPLET以端到端的方式,识别在维基百科句子中添加主题实体信息的位置,将该信息融合到标记和提及实体的表示中,并监督网络学习,从而重新将主题实体纳入考虑。实验表明,KEPLET应用于两种代表性知识增强预训练语言模型时,展现了一般性和优越性,在四个以实体为中心的任务上取得了显著改进。