This paper investigates the use of unsupervised text-to-speech synthesis (TTS) as a data augmentation method to improve accented speech recognition. TTS systems are trained with a small amount of accented speech training data and their pseudo-labels rather than manual transcriptions, and hence unsupervised. This approach enables the use of accented speech data without manual transcriptions to perform data augmentation for accented speech recognition. Synthetic accented speech data, generated from text prompts by using the TTS systems, are then combined with available non-accented speech data to train automatic speech recognition (ASR) systems. ASR experiments are performed in a self-supervised learning framework using a Wav2vec2.0 model which was pre-trained on large amount of unsupervised accented speech data. The accented speech data for training the unsupervised TTS are read speech, selected from L2-ARCTIC and British Isles corpora, while spontaneous conversational speech from the Edinburgh international accents of English corpus are used as the evaluation data. Experimental results show that Wav2vec2.0 models which are fine-tuned to downstream ASR task with synthetic accented speech data, generated by the unsupervised TTS, yield up to 6.1% relative word error rate reductions compared to a Wav2vec2.0 baseline which is fine-tuned with the non-accented speech data from Librispeech corpus.
翻译:本文研究了利用无监督文本到语音合成(TTS)作为数据增强方法以提升口音语音识别性能。TTS系统使用少量口音语音训练数据及其伪标签(而非人工转录文本)进行训练,因此属于无监督方法。该技术使得无需人工转录的口音语音数据能够用于口音语音识别的数据增强。通过TTS系统基于文本提示生成的口音合成语音数据,与可用的非口音语音数据结合,共同用于训练自动语音识别(ASR)系统。ASR实验在自监督学习框架下进行,采用在大规模无监督口音语音数据上预训练的Wav2vec2.0模型。用于训练无监督TTS的口音语音数据选自L2-ARCTIC和不列颠群岛语料库的朗读语音,而评估数据则采用爱丁堡国际英语口音语料库中的自发对话语音。实验结果表明:与使用Librispeech语料库非口音语音数据微调的Wav2vec2.0基线模型相比,采用无监督TTS生成的口音合成语音数据进行下游ASR任务微调的Wav2vec2.0模型,实现了最高达6.1%的相对词错误率降低。