Unsupervised automatic speech recognition (ASR) aims to learn the mapping between the speech signal and its corresponding textual transcription without the supervision of paired speech-text data. A word/phoneme in the speech signal is represented by a segment of speech signal with variable length and unknown boundary, and this segmental structure makes learning the mapping between speech and text challenging, especially without paired data. In this paper, we propose REBORN, Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR. REBORN alternates between (1) training a segmentation model that predicts the boundaries of the segmental structures in speech signals and (2) training the phoneme prediction model, whose input is a segmental structure segmented by the segmentation model, to predict a phoneme transcription. Since supervised data for training the segmentation model is not available, we use reinforcement learning to train the segmentation model to favor segmentations that yield phoneme sequence predictions with a lower perplexity. We conduct extensive experiments and find that under the same setting, REBORN outperforms all prior unsupervised ASR models on LibriSpeech, TIMIT, and five non-English languages in Multilingual LibriSpeech. We comprehensively analyze why the boundaries learned by REBORN improve the unsupervised ASR performance.
翻译:无监督自动语音识别(ASR)旨在无需配对语音-文本数据监督的情况下,学习语音信号与其对应文本转录之间的映射关系。语音信号中的词/音素由长度可变且边界未知的语音片段表示,这种分段结构使得学习语音与文本之间的映射极具挑战性,尤其是在缺乏配对数据的情况下。本文提出REBORN——一种融合强化学习边界分割与迭代训练的无监督ASR方法。该方法交替执行以下两个步骤:(1) 训练预测语音信号中分段结构边界的分割模型;(2) 训练以分割模型输出的分段结构为输入的音素预测模型,用于生成音素转录。由于缺乏训练分割模型的监督数据,我们采用强化学习优化分割模型,使其倾向于产生能降低音素序列预测困惑度的分割结果。通过大量实验发现,在相同设置下,REBORN在LibriSpeech、TIMIT及多语种LibriSpeech中的五种非英语语言上均超越所有先前无监督ASR模型。我们全面分析了REBORN学习到的边界为何能提升无监督ASR性能。