Name Entity Disambiguation is the Natural Language Processing task of identifying textual records corresponding to the same Named Entity, i.e. real-world entities represented as a list of attributes (names, places, organisations, etc.). In this work, we face the task of disambiguating companies on the basis of their written names. We propose a Siamese LSTM Network approach to extract -- via supervised learning -- an embedding of company name strings in a (relatively) low dimensional vector space and use this representation to identify pairs of company names that actually represent the same company (i.e. the same Entity). Given that the manual labelling of string pairs is a rather onerous task, we analyse how an Active Learning approach to prioritise the samples to be labelled leads to a more efficient overall learning pipeline. With empirical investigations, we show that our proposed Siamese Network outperforms several benchmark approaches based on standard string matching algorithms when enough labelled data are available. Moreover, we show that Active Learning prioritisation is indeed helpful when labelling resources are limited, and let the learning models reach the out-of-sample performance saturation with less labelled data with respect to standard (random) data labelling approaches.
翻译:命名实体消歧是自然语言处理中的一项任务,旨在识别对应于同一命名实体(即由属性列表(名称、地点、组织等)表示的现实世界实体)的文本记录。本文中,我们以公司名称为基础,开展公司实体消歧任务。我们提出一种孪生长短时记忆网络方法,通过监督学习将公司名称字符串嵌入到(相对)低维向量空间中,并利用该表示识别实际指向同一公司(即同一实体)的公司名称对。鉴于字符串对的人工标注是一项相当繁重的任务,我们分析了如何采用主动学习方法对待标注样本进行优先级排序,从而构建更高效的完整学习流程。通过实验研究表明,当拥有足够多的标注数据时,我们提出的孪生网络在性能上优于基于标准字符串匹配算法的多种基准方法。此外,我们还证明,在标注资源有限的情况下,主动学习优先级排序确实有效,并且与标准(随机)数据标注方法相比,它能让学习模型使用更少的标注数据达到样本外性能的饱和状态。