Speaker embeddings are widely used in speaker verification systems and other applications where it is useful to characterise the voice of a speaker with a fixed-length vector. These embeddings tend to be treated as "black box" encodings, and how they relate to conventional acoustic and phonetic dimensions of voices has not been widely studied. In this paper we investigate how state-of-the-art speaker embedding systems represent the acoustic characteristics of speakers as described by conventional acoustic descriptors, age, and gender. Using a large corpus of 10,000 speakers and three embedding systems we show that a small set of 9 acoustic parameters chosen to be "interpretable" predict embeddings about the same as 7 principal components, corresponding to over 50% of variance in the data. We show that some principal dimensions operate differently for male and female speakers, suggesting there is implicit gender recognition within the embedding systems. However we show that speaker age is not well captured by embeddings, suggesting opportunities exist for improvements in their calculation.
翻译:说话人嵌入广泛应用于说话人验证系统及其他需要以固定长度向量表征说话人语音特征的场景。这些嵌入通常被视为"黑箱"编码,其与传统语音声学和语音学维度的关联尚未得到广泛研究。本文探究了最先进的说话人嵌入系统如何表征由传统声学描述符、年龄和性别定义的说话人声学特征。通过使用包含10,000名说话人的大型语料库及三种嵌入系统,我们证明:选取的9个具有"可解释性"的声学参数对嵌入的预测能力,与对应数据中超过50%方差的7个主成分相当。研究发现某些主维度在男性和女性说话人中呈现不同作用模式,表明嵌入系统内存在隐式的性别识别机制。然而,说话人年龄信息在嵌入中未能得到充分表征,这提示其计算方法存在改进空间。