In this paper, we present a novel approach for text independent phone-to-audio alignment based on phoneme recognition, representation learning and knowledge transfer. Our method leverages a self-supervised model (wav2vec2) fine-tuned for phoneme recognition using a Connectionist Temporal Classification (CTC) loss, a dimension reduction model and a frame-level phoneme classifier trained thanks to forced-alignment labels (using Montreal Forced Aligner) to produce multi-lingual phonetic representations, thus requiring minimal additional training. We evaluate our model using synthetic native data from the TIMIT dataset and the SCRIBE dataset for American and British English, respectively. Our proposed model outperforms the state-of-the-art (charsiu) in statistical metrics and has applications in language learning and speech processing systems. We leave experiments on other languages for future work but the design of the system makes it easily adaptable to other languages.
翻译:本文提出了一种新颖的文本无关音素-音频对齐方法,该方法基于音素识别、表征学习与知识迁移。我们的方法利用自监督模型(wav2vec2),通过连接主义时序分类(CTC)损失进行音素识别微调,结合降维模型以及利用强制对齐标签(采用蒙特利尔强制对齐器)训练的帧级音素分类器,生成多语言音素表征,从而仅需极少的额外训练。我们分别使用TIMIT数据集和SCRIBE数据集生成的合成母语数据,对美国英语和英国英语进行模型评估。所提模型在统计指标上优于现有最优方法(charsiu),并在语言学习与语音处理系统中具有应用前景。多语言实验留待未来工作,但系统设计使其易于扩展至其他语言。