Developing a practically-robust automatic speech recognition (ASR) is challenging since the model should not only maintain the original performance on clean samples, but also achieve consistent efficacy under small volume perturbations and large domain shifts. To address this problem, we propose a novel WavAugment Guided Phoneme Adversarial Training (wapat). wapat use adversarial examples in phoneme space as augmentation to make the model invariant to minor fluctuations in phoneme representation and preserve the performance on clean samples. In addition, wapat utilizes the phoneme representation of augmented samples to guide the generation of adversaries, which helps to find more stable and diverse gradient-directions, resulting in improved generalization. Extensive experiments demonstrate the effectiveness of wapat on End-to-end Speech Challenge Benchmark (ESB). Notably, SpeechLM-wapat outperforms the original model by 6.28% WER reduction on ESB, achieving the new state-of-the-art.
翻译:开发一种实际鲁棒的自动语音识别(ASR)模型具有挑战性,因为模型不仅需要在干净样本上保持原有性能,还需在小幅音量扰动和大范围领域偏移下实现一致的有效性。为解决该问题,我们提出了一种新颖的WavAugment引导的音素对抗训练方法(wapat)。wapat利用音素空间中的对抗样本作为数据增强手段,使模型对音素表示的微小波动具有不变性,同时保留在干净样本上的性能。此外,wapat通过增强样本的音素表示来引导对抗样本的生成,这有助于找到更稳定且多样化的梯度方向,从而提升泛化能力。大量实验证明了wapat在端到端语音挑战基准(ESB)上的有效性。值得注意的是,SpeechLM-wapat在ESB上相比原始模型实现了6.28%的词错误率(WER)降低,达到了新的最优性能。