Links are a fundamental part of information networks, turning isolated pieces of knowledge into a network of information that is much richer than the sum of its parts. However, adding a new link to the network is not trivial: it requires not only the identification of a suitable pair of source and target entities but also the understanding of the content of the source to locate a suitable position for the link in the text. The latter problem has not been addressed effectively, particularly in the absence of text spans in the source that could serve as anchors to insert a link to the target entity. To bridge this gap, we introduce and operationalize the task of entity insertion in information networks. Focusing on the case of Wikipedia, we empirically show that this problem is, both, relevant and challenging for editors. We compile a benchmark dataset in 105 languages and develop a framework for entity insertion called LocEI (Localized Entity Insertion) and its multilingual variant XLocEI. We show that XLocEI outperforms all baseline models (including state-of-the-art prompt-based ranking with LLMs such as GPT-4) and that it can be applied in a zero-shot manner on languages not seen during training with minimal performance drop. These findings are important for applying entity insertion models in practice, e.g., to support editors in adding links across the more than 300 language versions of Wikipedia.
翻译:链接是信息网络的基础组成部分,它将孤立的知识片段转化为信息网络,其丰富程度远超各部分简单相加之和。然而,向网络中添加新链接并非易事:这不仅需要识别合适的源实体与目标实体对,还需理解源文本内容以确定链接在文本中的合适插入位置。后一问题尚未得到有效解决,尤其在源文本中缺乏可作为指向目标实体链接锚点的文本片段时更为突出。为弥补这一空白,我们提出并实现了信息网络中实体插入的任务。聚焦维基百科案例,我们通过实证研究表明该问题对编辑者既具现实意义又充满挑战。我们构建了涵盖105种语言的基准数据集,开发了名为LocEI(局部化实体插入)的实体插入框架及其多语言变体XLocEI。实验证明,XLocEI在所有基线模型(包括基于提示排序的先进大语言模型如GPT-4)中表现最优,且能以零样本方式应用于训练未见语言,仅伴随极小的性能下降。这些发现对于实体插入模型的实际应用具有重要意义,例如可辅助编辑者在维基百科300多种语言版本中添加跨语言链接。