Most of the current speech data augmentation methods operate on either the raw waveform or the amplitude spectrum of speech. In this paper, we propose a novel speech data augmentation method called PhasePerturbation that operates dynamically on the phase spectrum of speech. Instead of statically rotating a phase by a constant degree, PhasePerturbation utilizes three dynamic phase spectrum operations, i.e., a randomization operation, a frequency masking operation, and a temporal masking operation, to enhance the diversity of speech data. We conduct experiments on wav2vec2.0 pre-trained ASR models by fine-tuning them with the PhasePerturbation augmented TIMIT corpus. The experimental results demonstrate 10.9\% relative reduction in the word error rate (WER) compared with the baseline model fine-tuned without any augmentation operation. Furthermore, the proposed method achieves additional improvements (12.9\% and 15.9\%) in WER by complementing the Vocal Tract Length Perturbation (VTLP) and the SpecAug, which are both amplitude spectrum-based augmentation methods. The results highlight the capability of PhasePerturbation to improve the current amplitude spectrum-based augmentation methods.
翻译:当前多数语音数据增强方法主要作用于原始波形或语音的幅度谱。本文提出一种名为PhasePerturbation的新型语音数据增强方法,该方法可动态作用于语音的相位谱。与静态固定角度旋转相位不同,PhasePerturbation采用三种动态相位谱操作——随机化操作、频率掩蔽操作和时间掩蔽操作——以增强语音数据的多样性。我们在wav2vec2.0预训练ASR模型上开展实验,通过使用PhasePerturbation增强的TIMIT语料库进行微调。实验结果表明,与未使用任何增强操作的基线微调模型相比,词错误率(WER)相对降低了10.9%。此外,通过补充基于幅度谱的增强方法(声道长度扰动VTLP和SpecAug),所提方法在WER上实现了进一步提升(12.9%和15.9%)。这些结果突显了PhasePerturbation在改进现有幅度谱增强方法方面的能力。