Named Entity Recognition (NER) is a fundamental task in NLP that is used to locate the key information in text and is primarily applied in conversational and search systems. In commercial applications, NER or comparable slot-filling methods have been widely deployed for popular languages. NER is used in applications such as human resources, customer service, search engines, content classification, and academia. In this paper, we draw focus on identifying name entities for low-resource Indian languages that are closely related, like Hindi and Marathi. We use various adaptations of BERT such as baseBERT, AlBERT, and RoBERTa to train a supervised NER model. We also compare multilingual models with monolingual models and establish a baseline. In this work, we show the assisting capabilities of the Hindi and Marathi languages for the NER task. We show that models trained using multiple languages perform better than a single language. However, we also observe that blind mixing of all datasets doesn't necessarily provide improvements and data selection methods may be required.
翻译:命名实体识别(NER)是自然语言处理中的一项基础任务,用于定位文本中的关键信息,主要应用于对话系统和搜索引擎。在商业应用中,NER或类似的槽位填充方法已在主流语言中得到广泛部署。NER被应用于人力资源、客服、搜索引擎、内容分类及学术研究等领域。本文聚焦于识别密切相关的低资源印度语言(如印地语和马拉地语)中的命名实体。我们采用BERT的多种变体(包括baseBERT、AlBERT和RoBERTa)来训练监督式NER模型,并对比多语言模型与单语言模型的表现以建立基线。本研究展示了印地语和马拉地语在NER任务中的辅助能力,证明基于多语言训练的模型性能优于单语言模型。然而,我们也观察到盲目混合所有数据集未必能带来改进,可能需要采用数据选择方法。