Self-supervised speech representations are known to encode both speaker and phonetic information, but how they are distributed in the high-dimensional space remains largely unexplored. We hypothesize that they are encoded in orthogonal subspaces, a property that lends itself to simple disentanglement. Applying principal component analysis to representations of two predictive coding models, we identify two subspaces that capture speaker and phonetic variances, and confirm that they are nearly orthogonal. Based on this property, we propose a new speaker normalization method which collapses the subspace that encodes speaker information, without requiring transcriptions. Probing experiments show that our method effectively eliminates speaker information and outperforms a previous baseline in phone discrimination tasks. Moreover, the approach generalizes and can be used to remove information of unseen speakers.
翻译:自监督语音表征已知同时编码说话人信息和语音信息,但两者在高维空间中的分布方式仍是一个未解之谜。我们提出假设:这些信息编码于正交子空间中,这一特性有助于实现简单的解缠。通过对两种预测编码模型的表征进行主成分分析,我们识别出分别捕捉说话人方差和语音方差的两个子空间,并证实它们近似正交。基于该特性,我们提出一种无需转录文本的新型说话人归一化方法——通过压缩编码说话人信息的子空间实现。探测实验表明,该方法能有效消除说话人信息,并在音素区分任务中优于先前基线方法。此外,该方法具有泛化能力,可移除未见说话人的信息。