This paper presents a method for selecting appropriate synthetic speech samples from a given large text-to-speech (TTS) dataset as supplementary training data for an automatic speech recognition (ASR) model. We trained a neural network, which can be optimised using cross-entropy loss or Arcface loss, to measure the similarity of a synthetic data to real speech. We found that incorporating synthetic samples with considerable dissimilarity to real speech, owing in part to lexical differences, into ASR training is crucial for boosting recognition performance. Experimental results on Librispeech test sets indicate that, in order to maintain the same speech recognition accuracy as when using all TTS data, our proposed solution can reduce the size of the TTS data down below its $30\,\%$, which is superior to several baseline methods.
翻译:本文提出了一种方法,用于从给定的大型文本到语音(TTS)数据集中选取合适的合成语音样本,作为自动语音识别(ASR)模型的补充训练数据。我们训练了一个神经网络,该网络可通过交叉熵损失或Arcface损失进行优化,以衡量合成数据与真实语音的相似性。我们发现,将那些因词汇差异等因素而与真实语音存在显著不相似的合成样本纳入ASR训练,对于提升识别性能至关重要。在Librispeech测试集上的实验结果表明,为了保持与使用全部TTS数据时相同的语音识别准确率,我们提出的解决方案可将TTS数据规模缩减至其30%以下,优于多种基线方法。