Lemmatization holds significance in both natural language processing (NLP) and linguistics, as it effectively decreases data density and aids in comprehending contextual meaning. However, due to the highly inflected nature and morphological richness, lemmatization in Bangla text poses a complex challenge. In this study, we propose linguistic rules for lemmatization and utilize a dictionary along with the rules to design a lemmatizer specifically for Bangla. Our system aims to lemmatize words based on their parts of speech class within a given sentence. Unlike previous rule-based approaches, we analyzed the suffix marker occurrence according to the morpho-syntactic values and then utilized sequences of suffix markers instead of entire suffixes. To develop our rules, we analyze a large corpus of Bangla text from various domains, sources, and time periods to observe the word formation of inflected words. The lemmatizer achieves an accuracy of 96.36% when tested against a manually annotated test dataset by trained linguists and demonstrates competitive performance on three previously published Bangla lemmatization datasets. We are making the code and datasets publicly available at https://github.com/eblict-gigatech/BanLemma in order to contribute to the further advancement of Bangla NLP.
翻译:词形还原在自然语言处理和语言学中具有重要意义,因为它能有效降低数据密度并帮助理解上下文语义。然而,由于孟加拉语高度屈折变化及形态丰富性,其文本词形还原面临复杂挑战。本研究针对孟加拉语提出了专门的词形还原语言规则,并结合词典方法设计词形还原系统。该系统能根据给定句子中单词的词性类别进行词形还原。与以往基于规则的方法不同,我们根据形态句法特征分析后缀标记的出现规律,并采用后缀标记序列替代完整后缀。为制定规则,我们分析了涵盖不同领域、来源和时间段的大型孟加拉语文本语料库,以观察屈折词的构词模式。该词形还原工具在由经训练的语言学家人工标注的测试数据集上达到96.36%的准确率,并在三个先前发布的孟加拉语词形还原数据集上展现出竞争性表现。我们已在https://github.com/eblict-gigatech/BanLemma公开代码和数据集,旨在推动孟加拉语自然语言处理的进一步发展。