Despite decades of research on reverberant speech, comparing methods remains difficult because most corpora lack per-file acoustic annotations or provide limited documentation for reproduction. We present RIR-Mega-Speech, a corpus of approximately 117.5 hours created by convolving LibriSpeech utterances with roughly 5,000 simulated room impulse responses from the RIR-Mega collection. Every file includes RT60, direct-to-reverberant ratio (DRR), and clarity index ($C_{50}$) computed from the source RIR using clearly defined, reproducible procedures. We also provide scripts to rebuild the dataset and reproduce all evaluation results. Using Whisper small on 1,500 paired utterances, we measure 5.20% WER (95% CI: 4.69--5.78) on clean speech and 7.70% (7.04--8.35) on reverberant versions, corresponding to a paired increase of 2.50 percentage points (2.06--2.98). This represents a 48% relative degradation. WER increases monotonically with RT60 and decreases with DRR, consistent with prior perceptual studies. While the core finding that reverberation harms recognition is well established, we aim to provide the community with a standardized resource where acoustic conditions are transparent and results can be verified independently. The repository includes one-command rebuild instructions for both Windows and Linux environments.
翻译:尽管对混响语音的研究已持续数十年,但由于大多数语料库缺乏每个文件的声学标注或为复现提供的文档有限,方法比较仍然困难。我们提出了RIR-Mega-Speech,这是一个约117.5小时的语料库,通过将LibriSpeech语句与来自RIR-Mega集合的大约5,000个模拟房间脉冲响应进行卷积而创建。每个文件都包含使用明确定义、可复现程序从源RIR计算得出的RT60、直达声与混响声比(DRR)和清晰度指数($C_{50}$)。我们还提供了用于重建数据集和复现所有评估结果的脚本。使用Whisper small模型在1,500对语句上进行测试,我们在纯净语音上测得5.20%的词错误率(95%置信区间:4.69--5.78),在混响版本上测得7.70%(7.04--8.35),对应的配对增加为2.50个百分点(2.06--2.98)。这代表了48%的相对性能下降。词错误率随RT60单调增加,随DRR单调下降,这与先前的感知研究结果一致。虽然混响损害识别这一核心发现已广为人知,但我们旨在为研究社区提供一个标准化的资源,其中声学条件透明且结果可被独立验证。该代码库包含适用于Windows和Linux环境的一键重建指令。