How do speech models trained through self-supervised learning structure their representations? Previous studies have looked at how information is encoded in feature vectors across different layers. But few studies have considered whether speech characteristics are captured within individual dimensions of SSL features. In this paper we specifically look at speaker information using PCA on utterance-averaged representations. Using WavLM, we find that the principal dimension that explains most variance encodes pitch and associated characteristics like gender. Other individual principal dimensions correlate with intensity, noise levels, the second formant, and higher frequency characteristics. Finally, in synthesis experiments we show that most characteristics can be controlled by changing the corresponding dimensions. This provides a simple method to control characteristics of the output voice in synthesis applications.
翻译:通过自监督学习训练的语音模型如何构建其表征?先前研究主要关注信息在不同层级特征向量中的编码方式,但少有研究探讨语音特征是否被捕获于自监督学习特征的独立维度内。本文通过在主成分分析框架下对语句平均表征进行分析,专门研究说话人信息的编码机制。基于WavLM模型的研究发现:解释最大方差的主成分维度编码了基频及其相关特征(如性别);其他独立主成分维度分别与强度、噪声水平、第二共振峰及高频特征相关联。最后,合成实验表明多数特征可通过调整对应维度进行控制,这为合成应用中输出语音的特征调控提供了简洁有效的方法。