Accurate transcription of Bengali text to the International Phonetic Alphabet (IPA) is a challenging task due to the complex phonology of the language and context-dependent sound changes. This challenge is even more for regional Bengali dialects due to unavailability of standardized spelling conventions for these dialects, presence of local and foreign words popular in those regions and phonological diversity across different regions. This paper presents an approach to this sequence-to-sequence problem by introducing the District Guided Tokens (DGT) technique on a new dataset spanning six districts of Bangladesh. The key idea is to provide the model with explicit information about the regional dialect or "district" of the input text before generating the IPA transcription. This is achieved by prepending a district token to the input sequence, effectively guiding the model to understand the unique phonetic patterns associated with each district. The DGT technique is applied to fine-tune several transformer-based models, on this new dataset. Experimental results demonstrate the effectiveness of DGT, with the ByT5 model achieving superior performance over word-based models like mT5, BanglaT5, and umT5. This is attributed to ByT5's ability to handle a high percentage of out-of-vocabulary words in the test set. The proposed approach highlights the importance of incorporating regional dialect information into ubiquitous natural language processing systems for languages with diverse phonological variations. The following work was a result of the "Bhashamul" challenge, which is dedicated to solving the problem of Bengali text with regional dialects to IPA transcription https://www.kaggle.com/competitions/regipa/. The training and inference notebooks are available through the competition link.
翻译:将孟加拉语文本准确转写为国际音标是一项具有挑战性的任务,这源于该语言复杂的音系学特性以及依赖语境的音变现象。对于孟加拉语地区方言而言,这一挑战更为严峻,原因包括:这些方言缺乏标准化拼写规范、存在当地及外来流行词汇、以及不同地区的音系多样性。本文提出了一种基于序列到序列问题的解决方法,通过在覆盖孟加拉国六个地区的新数据集上引入区域引导令牌技术。其核心思想是在生成国际音标转写前,向模型提供输入文本的明确地区方言或"区域"信息。具体实现方式是在输入序列前添加区域令牌,从而有效引导模型理解与该区域相关的独特语音模式。我们将区域引导令牌技术应用于多个基于Transformer模型的微调过程中。实验结果表明区域引导令牌具有显著效果,其中ByT5模型在性能上超越了基于词级别的mT5、BanglaT5和umT5等模型。这一优势归因于ByT5处理测试集中大量未登录词的能力。本研究提出的方法凸显了将地区方言信息融入通用自然语言处理系统的重要性,尤其适用于具有丰富音系变体的语言。本工作源自"Bhashamul"挑战赛(https://www.kaggle.com/competitions/regipa/),该赛事致力于解决孟加拉语地区方言文本转写国际音标的问题。训练和推理笔记本可通过比赛链接获取。