Multilingual automatic speech recognition (ASR) systems have garnered attention for their potential to extend language coverage globally. While self-supervised learning (SSL) models, like MMS, have demonstrated their effectiveness in multilingual ASR, it is worth noting that various layers' representations potentially contain distinct information that has not been fully leveraged. In this study, we propose a novel method that leverages self-supervised hierarchical representations (SSHR) to fine-tune the MMS model. We first analyze the different layers of MMS and show that the middle layers capture language-related information, and the high layers encode content-related information, which gradually decreases in the final layers. Then, we extract a language-related frame from correlated middle layers and guide specific language extraction through self-attention mechanisms. Additionally, we steer the model toward acquiring more content-related information in the final layers using our proposed Cross-CTC. We evaluate SSHR on two multilingual datasets, Common Voice and ML-SUPERB, and the experimental results demonstrate that our method achieves state-of-the-art performance.
翻译:多语言自动语音识别系统因其扩展全球语言覆盖范围的潜力而受到关注。尽管自监督学习模型(如MMS)已展示出在多语言ASR中的有效性,但值得注意的是,不同层的表示可能包含尚未充分利用的独特信息。本研究提出了一种新方法,利用自监督层次表示微调MMS模型。我们首先分析了MMS的不同层,发现中间层捕获语言相关信息,高层编码内容相关信息,而最终层中这些信息逐渐减弱。随后,我们从相关中间层提取语言相关帧,并通过自注意力机制引导特定语言的提取。此外,我们使用提出的Cross-CTC引导模型在最终层中获取更多内容相关信息。我们在两个多语言数据集Common Voice和ML-SUPERB上评估了SSHR,实验结果表明,该方法实现了最先进的性能。