State of the art Named Entity Recognition (NER) models have achieved an impressive ability to extract common phrases from text that belong to labels such as location, organization, time, and person. However, typical NER systems that rely on having seen a specific entity in their training data in order to label an entity perform poorly on rare or unseen entities ta in order to label an entity perform poorly on rare or unseen entities (Derczynski et al., 2017). This paper attempts to improve recognition of person names, a diverse category that can grow any time someone is born or changes their name. In order for downstream tasks to not exhibit bias based on cultural background, a model should perform well on names from a variety of backgrounds. In this paper I experiment with the training data and input structure of an English Bi-LSTM name recognition model. I look at names from 103 countries to compare how well the model performs on names from different cultures, specifically in the context of a downstream task where extracted names will be matched to information on file. I find that a model with combined character and word input outperforms word-only models and may improve on accuracy compared to classical NER models that are not geared toward identifying unseen entity values.
翻译:当前最先进的命名实体识别模型在从文本中提取属于地点、机构、时间、人物等标签的常见短语方面已展现出卓越能力。然而,典型的命名实体识别系统依赖训练数据中已见过的特定实体来进行标注,这使得它们在处理罕见或未见过实体时表现不佳(Derczynski 等,2017)。本文致力于改进对人名的识别——这是一个多样化的类别,可能随着婴儿出生或姓名变更而随时扩展。为确保下游任务不因文化背景产生偏差,模型应对来自不同背景的姓名均表现良好。本文基于一个英文双向长短时记忆人名识别模型,对训练数据和输入结构进行了实验探究。我考察了来自103个国家的人名,以对比模型在不同文化背景下的表现,尤其关注在将提取的人名与档案信息匹配的下游任务场景中。研究发现,结合字符输入与词输入的模型优于仅使用词输入的模型,且相较于未专门针对未见过实体值设计的经典命名实体识别模型,其准确率可能有所提升。