Punctuation restoration enhances the readability of text and is critical for post-processing tasks in Automatic Speech Recognition (ASR), especially for low-resource languages like Bangla. In this study, we explore the application of transformer-based models, specifically XLM-RoBERTa-large, to automatically restore punctuation in unpunctuated Bangla text. We focus on predicting four punctuation marks: period, comma, question mark, and exclamation mark across diverse text domains. To address the scarcity of annotated resources, we constructed a large, varied training corpus and applied data augmentation techniques. Our best-performing model, trained with an augmentation factor of alpha = 0.20%, achieves an accuracy of 97.1% on the News test set, 91.2% on the Reference set, and 90.2% on the ASR set. Results show strong generalization to reference and ASR transcripts, demonstrating the model's effectiveness in real-world, noisy scenarios. This work establishes a strong baseline for Bangla punctuation restoration and contributes publicly available datasets and code to support future research in low-resource NLP.
翻译:标点符号恢复能提升文本可读性,对于自动语音识别(ASR)的后处理任务至关重要,尤其对于孟加拉语这类低资源语言。本研究探索基于Transformer的模型(特别是XLM-RoBERTa-large)在无标点孟加拉文本中自动恢复标点符号的应用。我们专注于跨多文本领域预测四种标点符号:句号、逗号、问号和感叹号。针对标注资源稀缺的问题,我们构建了大规模多样化训练语料库并应用数据增强技术。我们性能最佳的模型在增强因子α=0.20%的条件下训练,在新闻测试集上达到97.1%的准确率,在参考文本集上达到91.2%,在ASR转录集上达到90.2%。结果表明该模型对参考文本和ASR转录文本具有强大的泛化能力,证明了其在真实嘈杂场景中的有效性。本研究为孟加拉语标点恢复建立了坚实基线,并公开提供数据集和代码以支持未来低资源自然语言处理研究。