BanglaCoNER: Towards Robust Bangla Complex Named Entity Recognition

Named Entity Recognition (NER) is a fundamental task in natural language processing that involves identifying and classifying named entities in text. But much work hasn't been done for complex named entity recognition in Bangla, despite being the seventh most spoken language globally. CNER is a more challenging task than traditional NER as it involves identifying and classifying complex and compound entities, which are not common in Bangla language. In this paper, we present the winning solution of Bangla Complex Named Entity Recognition Challenge - addressing the CNER task on BanglaCoNER dataset using two different approaches, namely Conditional Random Fields (CRF) and finetuning transformer based Deep Learning models such as BanglaBERT. The dataset consisted of 15300 sentences for training and 800 sentences for validation, in the .conll format. Exploratory Data Analysis (EDA) on the dataset revealed that the dataset had 7 different NER tags, with notable presence of English words, suggesting that the dataset is synthetic and likely a product of translation. We experimented with a variety of feature combinations including Part of Speech (POS) tags, word suffixes, Gazetteers, and cluster information from embeddings, while also finetuning the BanglaBERT (large) model for NER. We found that not all linguistic patterns are immediately apparent or even intuitive to humans, which is why Deep Learning based models has proved to be the more effective model in NLP, including CNER task. Our fine tuned BanglaBERT (large) model achieves an F1 Score of 0.79 on the validation set. Overall, our study highlights the importance of Bangla Complex Named Entity Recognition, particularly in the context of synthetic datasets. Our findings also demonstrate the efficacy of Deep Learning models such as BanglaBERT for NER in Bangla language.

翻译：命名实体识别（NER）是自然语言处理中的基础任务，涉及文本中命名实体的识别与分类。然而，尽管孟加拉语是全球第七大使用语言，针对该语言的复杂命名实体识别研究仍显不足。相较于传统NER，CNER是一项更具挑战性的任务，因为它需要识别并分类孟加拉语中不常见的复杂复合实体。本文提出了Bangla复杂命名实体识别挑战赛的优胜方案——基于BanglaCoNER数据集，采用条件随机场（CRF）与微调Transformer深度学习模型（如BanglaBERT）两种方法解决CNER任务。数据集包含15300句训练集和800句验证集，采用.conll格式。探索性数据分析（EDA）显示该数据集包含7种不同的NER标签，且存在显著的英语词汇特征，提示该数据集为合成数据，很可能源于翻译。我们实验了多种特征组合，包括词性（POS）标签、词后缀、地名词典及嵌入聚类信息，同时对BanglaBERT（large）模型进行NER微调。研究发现，并非所有语言模式对人类而言都直观易懂，这正是深度学习模型在NLP及CNER任务中表现更佳的原因。微调后的BanglaBERT（large）模型在验证集上取得了0.79的F1分数。总体而言，本研究凸显了孟加拉语复杂命名实体识别的重要性，尤其是在合成数据集场景下。实验结果也证明了BanglaBERT等深度学习模型在孟加拉语NER任务中的有效性。