Bangla is the sixth most widely spoken language globally, with approximately 234 million native speakers. However, progress in open-source Bangla machine translation remains limited. Most online resources are in English and often remain untranslated into Bangla, excluding millions from accessing essential information. Existing research in Bangla translation primarily focuses on formal language, neglecting the more commonly used informal language. This is largely due to the lack of pairwise Bangla-English data and advanced translation models. If datasets and models can be enhanced to better handle natural, informal Bangla, millions of people will benefit from improved online information access. In this research, we explore current state-of-the-art models and propose improvements to Bangla translation by developing a dataset from informal sources like social media and conversational texts. This work aims to advance Bangla machine translation by focusing on informal language translation and improving accessibility for Bangla speakers in the digital world.
翻译:孟加拉语是全球第六大广泛使用的语言,拥有约2.34亿母语使用者。然而,开源孟加拉语机器翻译的进展仍然有限。大多数在线资源为英语,且往往未翻译成孟加拉语,导致数百万人无法获取关键信息。现有的孟加拉语翻译研究主要集中于正式语言,忽视了更常用的非正式语言。这主要是由于缺乏成对的孟加拉语-英语数据及先进的翻译模型。若能通过增强数据集和模型以更好地处理自然、非正式的孟加拉语,数百万人将受益于改进的在线信息获取。在本研究中,我们探索了当前最先进的模型,并通过从社交媒体和对话文本等非正式来源构建数据集,提出了改进孟加拉语翻译的方法。此项工作旨在通过聚焦非正式语言翻译,提升孟加拉语使用者在数字世界中的信息可及性,从而推动孟加拉语机器翻译的发展。