Although Automatic Speech Recognition (ASR) in Bengali has seen significant progress, processing long-duration audio and performing robust speaker diarization remain critical research gaps. To address the severe scarcity of joint ASR and diarization resources for this language, we introduce Lipi-Ghor-882, a comprehensive 882-hour multi-speaker Bengali dataset. In this paper, detailing our submission to the DL Sprint 4.0 competition, we systematically evaluate various architectures and approaches for long-form Bengali speech. For ASR, we demonstrate that raw data scaling is ineffective; instead, targeted fine-tuning utilizing perfectly aligned annotations paired with synthetic acoustic degradation (noise and reverberation) emerges as the singular most effective approach. Conversely, for speaker diarization, we observed that global open-source state-of-the-art models (such as Diarizen) performed surprisingly poorly on this complex dataset. Extensive model retraining yielded negligible improvements; instead, strategic, heuristic post-processing of baseline model outputs proved to be the primary driver for increasing accuracy. Ultimately, this work outlines a highly optimized dual pipeline achieving a $\sim$0.019 Real-Time Factor (RTF), establishing a practical, empirically backed benchmark for low-resource, long-form speech processing.
翻译:尽管孟加拉语自动语音识别(ASR)已取得显著进展,但处理长时音频与实现鲁棒的说话人日志仍是关键的研究空白。为应对该语言在联合ASR与日志任务上资源的严重匮乏,我们引入了Lipi-Ghor-882——一个包含882小时的多说话人孟加拉语综合数据集。本文详细阐述了我们在DL Sprint 4.0竞赛中的提交方案,系统评估了多种适用于长时孟加拉语语音的架构与方法。对于ASR,我们发现原始数据扩增效果有限;相反,利用完美对齐的标注进行针对性微调,并结合合成声学降质(噪声与混响),被证明是最为有效的单一方法。与之相对,对于说话人日志,我们观察到全球开源的最新模型(如Diarizen)在此复杂数据集上表现意外不佳。广泛的模型重训练带来的改进微乎其微;而通过对基线模型输出进行策略性的启发式后处理,成为提升准确率的主要驱动力。最终,本研究构建了一个高度优化的双管道系统,实现了约0.019的实时因子(RTF),为低资源、长时语音处理建立了一个实用且基于实证的基准。