Due to their capacity to acquire world knowledge from large corpora, pre-trained language models (PLMs) are extensively used in ultra-fine entity typing tasks where the space of labels is extremely large. In this work, we explore the limitations of the knowledge acquired by PLMs by proposing a novel heuristic to approximate the pre-training distribution of entities when the pre-training data is unknown. Then, we systematically demonstrate that entity-typing approaches that rely solely on the parametric knowledge of PLMs struggle significantly with entities at the long tail of the pre-training distribution, and that knowledge-infused approaches can account for some of these shortcomings. Our findings suggest that we need to go beyond PLMs to produce solutions that perform well for infrequent entities.
翻译:由于预训练语言模型(PLMs)具备从大规模语料库中获取世界知识的能力,它们被广泛应用于标签空间极大的超细粒度实体分类任务。本研究通过提出一种新颖的启发式方法,在预训练数据未知的情况下近似估计实体的预训练分布,从而探索PLMs所获知识的局限性。我们系统性地证明,仅依赖PLMs参数化知识的实体分类方法在处理预训练分布长尾区域的实体时表现显著受限,而知识增强方法能够部分弥补这些缺陷。我们的研究结果表明,需要超越PLMs的框架来构建对低频实体具有良好性能的解决方案。