Entity standardization maps noisy mentions from free-form text to standard entities in a knowledge base. The unique challenge of this task relative to other entity-related tasks is the lack of surrounding context and numerous variations in the surface form of the mentions, especially when it comes to generalization across domains where labeled data is scarce. Previous research mostly focuses on developing models either heavily relying on context, or dedicated solely to a specific domain. In contrast, we propose CoSiNES, a generic and adaptable framework with Contrastive Siamese Network for Entity Standardization that effectively adapts a pretrained language model to capture the syntax and semantics of the entities in a new domain. We construct a new dataset in the technology domain, which contains 640 technical stack entities and 6,412 mentions collected from industrial content management systems. We demonstrate that CoSiNES yields higher accuracy and faster runtime than baselines derived from leading methods in this domain. CoSiNES also achieves competitive performance in four standard datasets from the chemistry, medicine, and biomedical domains, demonstrating its cross-domain applicability.
翻译:实体标准化将自由文本中的噪声提及映射到知识库中的标准实体。与其它实体相关任务相比,该任务面临的独特挑战在于缺乏上下文语境以及提及表面形式的多样化变异,尤其是在标记数据稀缺的跨域泛化场景中。以往研究主要集中于开发高度依赖上下文或仅针对特定领域的模型。相比之下,我们提出CoSiNES——一种结合对比孪生网络的通用可适应框架,通过有效适配预训练语言模型来捕获新领域中实体的句法与语义特征。我们在技术领域构建了一个新数据集,该数据集包含从工业内容管理系统收集的640个技术栈实体和6412个提及。实验表明,CoSiNES在该领域基于领先方法构建的基线模型上实现了更高的准确率和更快的运行速度。此外,CoSiNES在化学、医学及生物医学领域的四个标准数据集中也展现出具有竞争力的性能,验证了其跨领域适用性。