Transcribing Bengali Text with Regional Dialects to IPA using District Guided Tokens

Accurate transcription of Bengali text to the International Phonetic Alphabet (IPA) is a challenging task due to the complex phonology of the language and context-dependent sound changes. This challenge is even more for regional Bengali dialects due to unavailability of standardized spelling conventions for these dialects, presence of local and foreign words popular in those regions and phonological diversity across different regions. This paper presents an approach to this sequence-to-sequence problem by introducing the District Guided Tokens (DGT) technique on a new dataset spanning six districts of Bangladesh. The key idea is to provide the model with explicit information about the regional dialect or "district" of the input text before generating the IPA transcription. This is achieved by prepending a district token to the input sequence, effectively guiding the model to understand the unique phonetic patterns associated with each district. The DGT technique is applied to fine-tune several transformer-based models, on this new dataset. Experimental results demonstrate the effectiveness of DGT, with the ByT5 model achieving superior performance over word-based models like mT5, BanglaT5, and umT5. This is attributed to ByT5's ability to handle a high percentage of out-of-vocabulary words in the test set. The proposed approach highlights the importance of incorporating regional dialect information into ubiquitous natural language processing systems for languages with diverse phonological variations. The following work was a result of the "Bhashamul" challenge, which is dedicated to solving the problem of Bengali text with regional dialects to IPA transcription https://www.kaggle.com/competitions/regipa/. The training and inference notebooks are available through the competition link.

翻译：将孟加拉语文本准确转写为国际音标是一项具有挑战性的任务，这源于该语言复杂的音系学特性以及依赖语境的音变现象。对于孟加拉语地区方言而言，这一挑战更为严峻，原因包括：这些方言缺乏标准化拼写规范、存在当地及外来流行词汇、以及不同地区的音系多样性。本文提出了一种基于序列到序列问题的解决方法，通过在覆盖孟加拉国六个地区的新数据集上引入区域引导令牌技术。其核心思想是在生成国际音标转写前，向模型提供输入文本的明确地区方言或"区域"信息。具体实现方式是在输入序列前添加区域令牌，从而有效引导模型理解与该区域相关的独特语音模式。我们将区域引导令牌技术应用于多个基于Transformer模型的微调过程中。实验结果表明区域引导令牌具有显著效果，其中ByT5模型在性能上超越了基于词级别的mT5、BanglaT5和umT5等模型。这一优势归因于ByT5处理测试集中大量未登录词的能力。本研究提出的方法凸显了将地区方言信息融入通用自然语言处理系统的重要性，尤其适用于具有丰富音系变体的语言。本工作源自"Bhashamul"挑战赛（https://www.kaggle.com/competitions/regipa/），该赛事致力于解决孟加拉语地区方言文本转写国际音标的问题。训练和推理笔记本可通过比赛链接获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日