Multilingual automatic speech recognition (ASR) systems have garnered attention for their potential to extend language coverage globally. While self-supervised learning (SSL) has demonstrated its effectiveness in multilingual ASR, it is worth noting that the various layers' representations of SSL potentially contain distinct information that has not been fully leveraged. In this study, we propose a novel method that leverages self-supervised hierarchical representations (SSHR) to fine-tune multilingual ASR. We first analyze the different layers of the SSL model for language-related and content-related information, uncovering layers that show a stronger correlation. Then, we extract a language-related frame from correlated middle layers and guide specific content extraction through self-attention mechanisms. Additionally, we steer the model toward acquiring more content-related information in the final layers using our proposed Cross-CTC. We evaluate SSHR on two multilingual datasets, Common Voice and ML-SUPERB, and the experimental results demonstrate that our method achieves state-of-the-art performance to the best of our knowledge.
翻译:多语言自动语音识别(ASR)系统因其扩展全球语言覆盖范围的潜力而备受关注。尽管自监督学习(SSL)已证明在多语言ASR中的有效性,但值得注意的是,SSL各层表征可能包含尚未充分利用的独特信息。本研究提出一种新方法,通过利用自监督分层表征(SSHR)来微调多语言ASR。我们首先分析SSL模型不同层级中与语言相关和内容相关的信息,揭示出相关性更强的层级。随后,从相关中间层提取语言相关框架,并通过自注意力机制引导特定内容提取。此外,我们利用所提出的交叉CTC(Cross-CTC)引导模型在最终层获取更多内容相关信息。我们在Common Voice和ML-SUPERB两个多语言数据集上评估SSHR,实验结果表明,据我们所知,该方法达到了最优性能。