Training unsupervised speech recognition systems presents challenges due to GAN-associated instability, misalignment between speech and text, and significant memory demands. To tackle these challenges, we introduce a novel ASR system, ESPUM. This system harnesses the power of lower-order N-skipgrams (up to N=3) combined with positional unigram statistics gathered from a small batch of samples. Evaluated on the TIMIT benchmark, our model showcases competitive performance in ASR and phoneme segmentation tasks. Access our publicly available code at https://github.com/lwang114/GraphUnsupASR.
翻译:训练无监督语音识别系统面临诸多挑战,包括生成对抗网络(GAN)相关的不稳定性、语音与文本之间的错位问题以及巨大的内存需求。为应对这些挑战,我们提出了一种新型语音识别系统ESPUM。该系统利用低阶N跳词序列(最高阶数N=3)的能力,并结合从少量样本中收集的位置单字统计特征。在TIMIT基准测试上的评估表明,我们的模型在语音识别和音素分割任务中展现了具有竞争力的性能。我们的代码已在https://github.com/lwang114/GraphUnsupASR 公开提供。