Training a high performance end-to-end speech (E2E) processing model requires an enormous amount of labeled speech data, especially in the era of data-centric artificial intelligence. However, labeled speech data are usually scarcer and more expensive for collection, compared to textual data. We propose Latent Synthesis (LaSyn), an efficient textual data utilization framework for E2E speech processing models. We train a latent synthesizer to convert textual data into an intermediate latent representation of a pre-trained speech model. These pseudo acoustic representations of textual data augment acoustic data for model training. We evaluate LaSyn on low-resource automatic speech recognition (ASR) and spoken language understanding (SLU) tasks. For ASR, LaSyn improves an E2E baseline trained on LibriSpeech train-clean-100, with relative word error rate reductions over 22.3% on different test sets. For SLU, LaSyn improves our E2E baseline by absolute 4.1% for intent classification accuracy and 3.8% for slot filling SLU-F1 on SLURP, and absolute 4.49% and 2.25% for exact match (EM) and EM-Tree accuracies on STOP respectively. With fewer parameters, the results of LaSyn are competitive to published state-of-the-art works. The results demonstrate the quality of the augmented training data. The source code will be available to the community.
翻译:训练高性能端到端语音(E2E)处理模型需要大量标注语音数据,尤其是在以数据为中心的人工智能时代。然而,与文本数据相比,标注语音数据通常更稀缺且采集成本更高。我们提出潜在合成(LaSyn),这是一种针对E2E语音处理模型的高效文本数据利用框架。我们训练一个潜在合成器,将文本数据转换为预训练语音模型的中间潜在表示。这些文本数据的伪声学表示用于增强模型训练中的声学数据。我们在低资源自动语音识别(ASR)和口语理解(SLU)任务上评估了LaSyn。在ASR任务中,LaSyn改进了基于LibriSpeech train-clean-100训练的E2E基线模型,在不同测试集上的相对词错误率降低超过22.3%。在SLU任务中,LaSyn在SLURP数据集上将意图分类准确率绝对提高了4.1%,槽填充SLU-F1值绝对提高了3.8%;在STOP数据集上将精确匹配(EM)和EM-Tree准确率分别绝对提高了4.49%和2.25%。尽管参数更少,LaSyn的结果与已发表的最先进工作具有竞争力。实验结果证明了增强训练数据的质量。源代码将向社区开放。