DQ-Data2vec: Decoupling Quantization for Multilingual Speech Recognition

Data2vec is a self-supervised learning (SSL) approach that employs a teacher-student architecture for contextual representation learning via masked prediction, demonstrating remarkable performance in monolingual ASR. Previous studies have revealed that data2vec's shallow layers capture speaker and language information, middle layers encode phoneme and word features, while deep layers are responsible for reconstruction. Language and phoneme features are crucial for multilingual ASR. However, data2vec's masked representation generation relies on multi-layer averaging, inevitably coupling these features. To address this limitation, we propose a decoupling quantization based data2vec (DQ-Data2vec) for multilingual ASR, which includes a data2vec backbone and two improved online K-means quantizers. Our core idea is using the K-means quantizer with specified cluster numbers to decouple language and phoneme information for masked prediction. Specifically, in the language quantization, considering that the number of languages is significantly different from other irrelevant features (e.g., speakers), we assign the cluster number to match the number of languages, explicitly decoupling shallow layers' language-related information from irrelevant features. This strategy is also applied to decoupling middle layers' phoneme and word features. In a self-supervised scenario, experiments on the CommonVoice dataset demonstrate that DQ-Data2vec achieves a relative reduction of 9.51% in phoneme error rate (PER) and 11.58% in word error rate (WER) compared to data2vec and UniData2vec. Moreover, in a weakly-supervised scenario incorporating language labels and high-resource language text labels, the relative reduction is 18.09% and 1.55%, respectively.

翻译：Data2vec是一种基于师生架构的自监督学习方法，通过掩码预测进行上下文表示学习，在单语言自动语音识别中表现出卓越性能。先前研究表明，data2vec的浅层网络捕获说话人和语言信息，中间层编码音素和词汇特征，而深层网络负责重构任务。语言与音素特征对多语言语音识别至关重要。然而，data2vec的掩码表示生成依赖于多层特征平均，不可避免地耦合了这些特征。为克服此局限，我们提出基于解耦量化的data2vec方法（DQ-Data2vec），该方法包含data2vec主干网络和两个改进的在线K-means量化器。其核心思想是采用指定聚类数量的K-means量化器，在掩码预测过程中解耦语言与音素信息。具体而言，在语言量化阶段，考虑到语言数量与其他无关特征（如说话人）存在显著差异，我们将聚类数设置为语言数量，从而显式解耦浅层语言相关特征与无关特征。该策略同样应用于解耦中间层的音素与词汇特征。在自监督场景下，CommonVoice数据集上的实验表明，DQ-Data2vec相比data2vec和UniData2vec在音素错误率上实现9.51%的相对降低，在词错误率上实现11.58%的相对降低。此外，在引入语言标签和高资源语言文本标签的弱监督场景中，相对降低幅度分别达到18.09%和1.55%。