Self-supervised speech representations are known to encode both speaker and phonetic information, but how they are distributed in the high-dimensional space remains largely unexplored. We hypothesize that they are encoded in orthogonal subspaces, a property that lends itself to simple disentanglement. Applying principal component analysis to representations of two predictive coding models, we identify two subspaces that capture speaker and phonetic variances, and confirm that they are nearly orthogonal. Based on this property, we propose a new speaker normalization method which collapses the subspace that encodes speaker information, without requiring transcriptions. Probing experiments show that our method effectively eliminates speaker information and outperforms a previous baseline in phone discrimination tasks. Moreover, the approach generalizes and can be used to remove information of unseen speakers.
翻译:自监督语音表征已知能同时编码说话人信息和语音信息,但这些信息在高维空间中的分布方式仍鲜有探索。我们假设这两种信息被编码在正交子空间中,这一特性有利于实现简洁的解耦。通过对两种预测编码模型的表征进行主成分分析,我们识别出分别捕捉说话人方差和语音方差的两个子空间,并证实它们近似正交。基于这一特性,我们提出了一种新的说话人归一化方法,该方法通过压缩编码说话人信息的子空间来实现归一化,且无需转录文本。探测实验表明,该方法能有效消除说话人信息,并在音素区分任务中优于先前基线。此外,该方案具备泛化能力,可用于移除未见说话人的信息。