Despite being one of the most widely spoken languages globally, Bangla remains a low-resource language in the field of Natural Language Processing (NLP). Mainstream Automatic Speech Recognition (ASR) and Speaker Diarization systems for Bangla struggles when processing longform audio exceeding 3060 seconds. This paper presents a robust framework specifically engineered for extended Bangla content by leveraging preexisting models enhanced with novel optimization pipelines for the DL Sprint 4.0 contest. Our approach utilizes Voice Activity Detection (VAD) optimization and Connectionist Temporal Classification (CTC) segmentation via forced word alignment to maintain temporal accuracy and transcription integrity over long durations. Additionally, we employed several finetuning techniques and preprocessed the data using augmentation techniques and noise removal. By bridging the performance gap in complex, multi-speaker environments, this work provides a scalable solution for real-world, longform Bangla speech applications.
翻译:尽管孟加拉语是全球使用最广泛的语言之一,但在自然语言处理领域仍属于低资源语言。现有的主流孟加拉语自动语音识别与说话人日志系统在处理超过30-60秒的长音频时面临困难。本文提出一个专为长时孟加拉语内容设计的鲁棒性框架,该框架通过利用已有模型并结合为DL Sprint 4.0竞赛设计的新型优化流程来实现。我们的方法采用语音活动检测优化以及基于强制词对齐的连接时序分类分割技术,以在长时音频中保持时间精度与转录完整性。此外,我们应用了多种微调技术,并通过数据增强与降噪技术对数据进行了预处理。通过弥合复杂多说话人环境下的性能差距,本工作为现实世界中的长时孟加拉语语音应用提供了一个可扩展的解决方案。