Unsupervised speech recognition (ASR-U) is the problem of learning automatic speech recognition (ASR) systems from unpaired speech-only and text-only corpora. While various algorithms exist to solve this problem, a theoretical framework is missing from studying their properties and addressing such issues as sensitivity to hyperparameters and training instability. In this paper, we proposed a general theoretical framework to study the properties of ASR-U systems based on random matrix theory and the theory of neural tangent kernels. Such a framework allows us to prove various learnability conditions and sample complexity bounds of ASR-U. Extensive ASR-U experiments on synthetic languages with three classes of transition graphs provide strong empirical evidence for our theory (code available at cactuswiththoughts/UnsupASRTheory.git).
翻译:无监督语音识别(ASR-U)旨在从无配对语音-文本语料库中学习自动语音识别(ASR)系统。尽管已有多种算法解决该问题,但尚缺乏理论框架来研究其特性及应对超参数敏感性与训练不稳定性等挑战。本文基于随机矩阵理论与神经正切核理论,提出一个通用理论框架以分析ASR-U系统的特性。该框架可证明ASR-U的多项可学习性条件及样本复杂度界。通过在三种转移图类别的合成语言上开展大量ASR-U实验,我们为理论提供了强有力的实证支持(代码见cactuswiththoughts/UnsupASRTheory.git)。