Synthetic data generated by text-to-speech (TTS) systems can be used to improve automatic speech recognition (ASR) systems in low-resource or domain mismatch tasks. It has been shown that TTS-generated outputs still do not have the same qualities as real data. In this work we focus on the temporal structure of synthetic data and its relation to ASR training. By using a novel oracle setup we show how much the degradation of synthetic data quality is influenced by duration modeling in non-autoregressive (NAR) TTS. To get reference phoneme durations we use two common alignment methods, a hidden Markov Gaussian-mixture model (HMM-GMM) aligner and a neural connectionist temporal classification (CTC) aligner. Using a simple algorithm based on random walks we shift phoneme duration distributions of the TTS system closer to real durations, resulting in an improvement of an ASR system using synthetic data in a semi-supervised setting.
翻译:文本转语音(TTS)系统生成的合成数据可用于改善低资源或领域不匹配任务中的自动语音识别(ASR)系统。已有研究表明,TTS生成的输出仍不具备与真实数据相同的质量。本研究聚焦于合成数据的时序结构及其与ASR训练的关系。通过采用一种新颖的神谕设置,我们揭示了非自回归(NAR)TTS中时长建模对合成数据质量退化的影响程度。为获取参考音素时长,我们使用了两种常见的对齐方法:隐马尔可夫高斯混合模型(HMM-GMM)对齐器和神经网络连接时序分类(CTC)对齐器。基于随机游走的简单算法,我们将TTS系统的音素时长分布调整至更接近真实时长,从而在半监督环境中使用合成数据的ASR系统性能得到提升。