Neural front-ends are an appealing alternative to traditional, fixed feature extraction pipelines for automatic speech recognition (ASR) systems since they can be directly trained to fit the acoustic model. However, their performance often falls short compared to classical methods, which we show is largely due to their increased susceptibility to overfitting. This work therefore investigates regularization methods for training ASR models with learnable feature extraction front-ends. First, we examine audio perturbation methods and show that larger relative improvements can be obtained for learnable features. Additionally, we identify two limitations in the standard use of SpecAugment for these front-ends and propose masking in the short time Fourier transform (STFT)-domain as a simple but effective modification to address these challenges. Finally, integrating both regularization approaches effectively closes the performance gap between traditional and learnable features.
翻译:神经前端作为自动语音识别系统中传统固定特征提取流程的替代方案具有显著优势,因其可直接训练以适应声学模型。然而,其性能常逊于经典方法,本文研究表明这主要源于其过拟合敏感性的增强。因此,本研究探索了针对可学习特征提取前端的自动语音识别模型正则化训练方法。首先,我们检验了音频扰动方法,证明可学习特征能获得更大的相对性能提升。此外,我们发现了SpecAugment在此类前端标准应用中的两个局限性,并提出在短时傅里叶变换域进行掩码处理作为应对这些挑战的简单而有效的改进方案。最终,整合两种正则化方法有效弥合了传统特征与可学习特征之间的性能差距。