The popular frameworks for self-supervised learning of speech representations have largely focused on frame-level masked prediction of speech regions. While this has shown promising downstream task performance for speech recognition and related tasks, this has largely ignored factors of speech that are encoded at coarser level, like characteristics of the speaker or channel that remain consistent through-out a speech utterance. In this work, we propose a framework for Learning Disentangled Self Supervised (termed as Learn2Diss) representations of speech, which consists of frame-level and an utterance-level encoder modules. The two encoders are initially learned independently, where the frame-level model is largely inspired by existing self supervision techniques, thereby learning pseudo-phonemic representations, while the utterance-level encoder is inspired by constrastive learning of pooled embeddings, thereby learning pseudo-speaker representations. The joint learning of these two modules consists of disentangling the two encoders using a mutual information based criterion. With several downstream evaluation experiments, we show that the proposed Learn2Diss achieves state-of-the-art results on a variety of tasks, with the frame-level encoder representations improving semantic tasks, while the utterance-level representations improve non-semantic tasks.
翻译:当前流行的语音表征自监督学习框架主要集中于对语音区域进行帧级掩码预测。尽管这种方法在语音识别及相关任务的下游性能上展现出良好前景,但它很大程度上忽略了在更粗粒度上编码的语音要素,例如在整个语音话语中保持一致的说话人或信道特征。本研究提出了一种学习解耦自监督语音表征的框架(称为Learn2Diss),该框架包含帧级编码器和话语级编码器模块。两个编码器首先被独立学习:帧级模型主要受现有自监督技术启发,从而学习伪音素表征;而话语级编码器则受基于聚合嵌入的对比学习思想启发,从而学习伪说话人表征。这两个模块的联合学习通过基于互信息的准则实现编码器间的解耦。通过多项下游评估实验,我们证明所提出的Learn2Diss在多种任务上取得了最先进的结果,其中帧级编码器表征提升了语义任务性能,而话语级表征则改善了非语义任务性能。