Towards Robust Bangla Complex Named Entity Recognition

Named Entity Recognition (NER) is a fundamental task in natural language processing that involves identifying and classifying named entities in text. But much work hasn't been done for complex named entity recognition in Bangla, despite being the seventh most spoken language globally. CNER is a more challenging task than traditional NER as it involves identifying and classifying complex and compound entities, which are not common in Bangla language. In this paper, we present the winning solution of Bangla Complex Named Entity Recognition Challenge - addressing the CNER task on BanglaCoNER dataset using two different approaches, namely Conditional Random Fields (CRF) and finetuning transformer based Deep Learning models such as BanglaBERT. The dataset consisted of 15300 sentences for training and 800 sentences for validation, in the .conll format. Exploratory Data Analysis (EDA) on the dataset revealed that the dataset had 7 different NER tags, with notable presence of English words, suggesting that the dataset is synthetic and likely a product of translation. We experimented with a variety of feature combinations including Part of Speech (POS) tags, word suffixes, Gazetteers, and cluster information from embeddings, while also finetuning the BanglaBERT (large) model for NER. We found that not all linguistic patterns are immediately apparent or even intuitive to humans, which is why Deep Learning based models has proved to be the more effective model in NLP, including CNER task. Our fine tuned BanglaBERT (large) model achieves an F1 Score of 0.79 on the validation set. Overall, our study highlights the importance of Bangla Complex Named Entity Recognition, particularly in the context of synthetic datasets. Our findings also demonstrate the efficacy of Deep Learning models such as BanglaBERT for NER in Bangla language.

翻译：命名实体识别（NER）是自然语言处理中的一项基础任务，旨在识别并分类文本中的命名实体。然而，尽管孟加拉语是全球使用人数第七多的语言，针对该语言的复杂命名实体识别工作仍十分有限。相较于传统NER，CNER更具挑战性，因为它需要识别并分类孟加拉语中不常见的复杂和复合实体。本文介绍了我们在孟加拉语复杂命名实体识别挑战赛中的获胜方案——采用两种不同方法解决BanglaCoNER数据集上的CNER任务：条件随机场（CRF）与基于Transformer的深度学习模型（如BanglaBERT）微调。该数据集以.conll格式提供，包含15,300个训练句子和800个验证句子。对数据集进行探索性数据分析（EDA）后发现，该数据集包含7种不同的NER标签，且英语词汇占比显著，表明该数据集为合成数据，很可能是翻译产物。我们实验了多种特征组合，包括词性（POS）标签、词缀、地名词典及嵌入向量中的聚类信息，同时针对NER任务微调了BanglaBERT（大型）模型。研究发现，并非所有语言模式都能被人类直观理解，这正是基于深度学习模型（包括CNER任务）在NLP中被证明更为有效的原因。我们微调后的BanglaBERT（大型）模型在验证集上的F1分数达到0.79。总体而言，本研究凸显了孟加拉语复杂命名实体识别的重要性，尤其是在合成数据集场景下。研究结果还证明了BanglaBERT等深度学习模型在孟加拉语言NER任务中的有效性。