Modern Automatic Speech Recognition (ASR) systems have made remarkable progress on standard benchmarks, yet performance gaps have emerged under real-world distribution shifts, caused by recording conditions, accents, speech impairments, and noise. Existing datasets and benchmarks typically isolate these factors, which overlooks their co-occurrence in real-world applications. In this paper, we argue that model robustness can be treated as a dynamic capability that continually develops, and we introduce MoDiCoL, a Modular Diagnostic Continual Learning dataset designed for controlled analysis of linguistic content, speaker characteristics, and acoustic environments. Furthermore, we propose a real-world-inspired continual learning curriculum to simulate incremental updates and study how robustness is acquired, transferred, and forgotten. We evaluate three continual learning strategies and provide detailed insights into robustness under evolving conditions.
翻译:现代自动语音识别(ASR)系统在标准基准测试中取得了显著进展,然而在由录音条件、口音、言语障碍和噪声引起的现实世界分布偏移下,性能差距逐渐显现。现有数据集和基准通常孤立处理这些因素,忽视了它们在真实应用中的共存现象。本文提出将模型鲁棒性视为一种持续发展的动态能力,并介绍MoDiCoL——一种专为控制性分析语言内容、说话者特征和声学环境而设计的模块化诊断持续学习数据集。此外,我们提出了一种受现实启发的持续学习课程,用于模拟增量更新,并研究鲁棒性如何被获取、迁移和遗忘。我们评估了三种持续学习策略,并提供了在演化条件下鲁棒性的详细见解。