Name Entity Disambiguation is the Natural Language Processing task of identifying textual records corresponding to the same Named Entity, i.e. real-world entities represented as a list of attributes (names, places, organisations, etc.). In this work, we face the task of disambiguating companies on the basis of their written names. We propose a Siamese LSTM Network approach to extract -- via supervised learning -- an embedding of company name strings in a (relatively) low dimensional vector space and use this representation to identify pairs of company names that actually represent the same company (i.e. the same Entity). Given that the manual labelling of string pairs is a rather onerous task, we analyse how an Active Learning approach to prioritise the samples to be labelled leads to a more efficient overall learning pipeline. With empirical investigations, we show that our proposed Siamese Network outperforms several benchmark approaches based on standard string matching algorithms when enough labelled data are available. Moreover, we show that Active Learning prioritisation is indeed helpful when labelling resources are limited, and let the learning models reach the out-of-sample performance saturation with less labelled data with respect to standard (random) data labelling approaches.
翻译:命名实体消歧是自然语言处理任务中的一项工作,旨在识别对应于同一命名实体(即表示为属性列表的现实世界实体,如姓名、地点、组织等)的文本记录。本研究针对基于书面名称进行公司消歧的任务,提出了一种孪生长短时记忆网络方法。该方法通过监督学习,将公司名称字符串嵌入到(相对)低维向量空间中,并利用该表示识别实际代表同一公司(即同一实体)的公司名称对。鉴于字符串对的人工标注是一项相当繁重的任务,我们分析了如何通过主动学习方法优先选择待标注样本,从而构建更高效的整体学习流程。实证研究表明,在有足够标注数据的情况下,我们提出的孪生网络性能优于多种基于标准字符串匹配算法的基准方法。此外,我们证明了当标注资源有限时,主动学习优先策略确实能发挥效用,使学习模型在比标准(随机)数据标注方法更少的标注数据下达到样本外性能饱和。