Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic grounding and produce omissions or hallucinations under severe, compositional distortions. We propose Mega-ASR, a unified ASR-in-the-wild framework that combines scalable compound-data construction with progressive acoustic-to-semantic optimization. We introduce Voices-in-the-Wild-2M, covering 7 classic acoustic phenomena and 54 physically plausible compound scenarios, and train Mega-ASR with Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization. Extensive experiments demonstrate that Mega-ASR achieves significant advantages over prior state-of-the-art systems on adverse-condition ASR benchmarks (45.69% vs. 54.01% on VOiCES R4-B-F, and 21.49% vs. 29.34% on NOIZEUS Sta-0). On complex compositional acoustic scenarios, Mega-ASR further delivers over 30% relative WER reduction against strong open- and closed-source baselines, establishing a scalable paradigm for robust ASR in-the-wild.
翻译:尽管自动语音识别(ASR)与大型音频语言模型取得了快速进展,但在真实世界环境中的鲁棒识别仍受限于"声学鲁棒性瓶颈":模型在严重且组合性失真条件下常丧失声学根基,产生漏读或幻觉。我们提出Mega-ASR,一种统一的野外ASR框架,结合可扩展的复合数据构建与渐进式声学到语义优化。我们引入Voices-in-the-Wild-2M数据集,涵盖7种经典声学现象与54种物理上合理的复合场景,并通过声学到语义渐进式监督微调及双粒度词错误率门控策略优化训练Mega-ASR。大量实验表明,Mega-ASR在恶劣条件ASR基准测试中相较于先前最先进系统具有显著优势(VOiCES R4-B-F上45.69%对比54.01%,NOIZEUS Sta-0上21.49%对比29.34%)。在复杂复合声学场景中,Mega-ASR相较于强开源与闭源基线进一步实现超过30%的相对词错误率降低,建立了面向野外鲁棒ASR的可扩展范式。