Unsupervised automatic speech recognition (ASR) aims to learn the mapping between the speech signal and its corresponding textual transcription without the supervision of paired speech-text data. A word/phoneme in the speech signal is represented by a segment of speech signal with variable length and unknown boundary, and this segmental structure makes learning the mapping between speech and text challenging, especially without paired data. In this paper, we propose REBORN,Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR. REBORN alternates between (1) training a segmentation model that predicts the boundaries of the segmental structures in speech signals and (2) training the phoneme prediction model, whose input is the speech feature segmented by the segmentation model, to predict a phoneme transcription. Since supervised data for training the segmentation model is not available, we use reinforcement learning to train the segmentation model to favor segmentations that yield phoneme sequence predictions with a lower perplexity. We conduct extensive experiments and find that under the same setting, REBORN outperforms all prior unsupervised ASR models on LibriSpeech, TIMIT, and five non-English languages in Multilingual LibriSpeech. We comprehensively analyze why the boundaries learned by REBORN improve the unsupervised ASR performance.
翻译:无监督自动语音识别(ASR)旨在无需配对语音-文本数据监督的情况下,学习语音信号与其对应文本转录之间的映射关系。语音信号中的词/音素由长度可变且边界未知的语音信号片段表示,这种分段结构使得语音与文本间的映射学习具有挑战性,尤其在缺乏配对数据时。本文提出REBORN(基于强化学习的边界分割与迭代训练用于无监督ASR)。REBORN交替执行以下两个步骤:(1) 训练一个分割模型以预测语音信号中分段结构的边界;(2) 训练音素预测模型,其输入为经分割模型处理后的语音特征,以预测音素转录序列。由于缺乏用于训练分割模型的监督数据,我们采用强化学习训练分割模型,使其倾向于产生能获得更低困惑度音素序列预测的分割结果。我们进行了大量实验,发现在相同设置下,REBORN在LibriSpeech、TIMIT以及Multilingual LibriSpeech中的五种非英语语言上均优于所有先前的无监督ASR模型。我们全面分析了REBORN所学习的边界为何能提升无监督ASR性能。