Bangla, one of the most widely spoken languages, remains underrepresented in state-of-the-art automatic speech recognition (ASR) research, particularly under noisy and speaker-diverse conditions. This paper presents BanglaRobustNet, a hybrid denoising-attention framework built on Wav2Vec-BERT, designed to address these challenges. The architecture integrates a diffusion-based denoising module to suppress environmental noise while preserving Bangla-specific phonetic cues, and a contextual cross-attention module that conditions recognition on speaker embeddings for robustness across gender, age, and dialects. Trained end-to-end with a composite objective combining CTC loss, phonetic consistency, and speaker alignment, BanglaRobustNet achieves substantial reductions in word error rate (WER) and character error rate (CER) compared to Wav2Vec-BERT and Whisper baselines. Evaluations on Mozilla Common Voice Bangla and augmented noisy speech confirm the effectiveness of our approach, establishing BanglaRobustNet as a robust ASR system tailored to low-resource, noise-prone linguistic settings.
翻译:孟加拉语作为使用最广泛的语言之一,在先进的自动语音识别(ASR)研究中仍代表性不足,尤其是在噪声和说话人多样的条件下。本文提出BanglaRobustNet,一种基于Wav2Vec-BERT构建的混合去噪-注意力框架,旨在应对这些挑战。该架构集成了一个基于扩散的去噪模块以抑制环境噪声,同时保留孟加拉语特有的语音线索;以及一个上下文交叉注意力模块,该模块将说话人嵌入作为识别条件,以实现跨性别、年龄和方言的鲁棒性。通过结合CTC损失、语音一致性和说话人对齐的复合目标进行端到端训练,与Wav2Vec-BERT和Whisper基线相比,BanglaRobustNet在词错误率(WER)和字符错误率(CER)上实现了显著降低。在Mozilla Common Voice孟加拉语数据集及增强噪声语音上的评估证实了我们方法的有效性,确立了BanglaRobustNet作为一个专为低资源、易受噪声干扰的语言环境定制的鲁棒ASR系统。