Data augmentations are known to improve robustness in speech-processing tasks. In this study, we summarize and compare different data augmentation strategies using S3PRL toolkit. We explore how HuBERT and wav2vec perform using different augmentation techniques (SpecAugment, Gaussian Noise, Speed Perturbation) for Phoneme Recognition (PR) and Automatic Speech Recognition (ASR) tasks. We evaluate model performance in terms of phoneme error rate (PER) and word error rate (WER). From the experiments, we observed that SpecAugment slightly improves the performance of HuBERT and wav2vec on the original dataset. Also, we show that models trained using the Gaussian Noise and Speed Perturbation dataset are more robust when tested with augmented test sets.
翻译:数据增强已知能提高语音处理任务的鲁棒性。本研究使用S3PRL工具包总结并比较了不同的数据增强策略。我们探索了HuBERT和wav2vec在音素识别(PR)和自动语音识别(ASR)任务中采用不同增强技术(SpecAugment、高斯噪声、速度扰动)时的表现。我们通过音素错误率(PER)和词错误率(WER)评估模型性能。实验发现,SpecAugment能略微提升HuBERT和wav2vec在原始数据集上的性能。同时,我们表明,使用高斯噪声和速度扰动数据集训练的模型在增强测试集上测试时具有更强的鲁棒性。