Speaker embeddings represent a means to extract representative vectorial representations from a speech signal such that the representation pertains to the speaker identity alone. The embeddings are commonly used to classify and discriminate between different speakers. However, there is no objective measure to evaluate the ability of a speaker embedding to disentangle the speaker identity from the other speech characteristics. This means that the embeddings are far from ideal, highly dependent on the training corpus and still include a degree of residual information pertaining to factors such as linguistic content, recording conditions or speaking style of the utterance. This paper introduces an analysis over six sets of speaker embeddings extracted with some of the most recent and high-performing DNN architectures, and in particular, the degree to which they are able to truly disentangle the speaker identity from the speech signal. To correctly evaluate the architectures, a large multi-speaker parallel speech dataset is used. The dataset includes 46 speakers uttering the same set of prompts, recorded in either a professional studio or their home environments. The analysis looks into the intra- and inter-speaker similarity measures computed over the different embedding sets, as well as if simple classification and regression methods are able to extract several residual information factors from the speaker embeddings. The results show that the discriminative power of the analyzed embeddings is very high, yet across all the analyzed architectures, residual information is still present in the representations in the form of a high correlation to the recording conditions, linguistic contents and utterance duration.
翻译:说话人嵌入是一种从语音信号中提取代表性向量表示的方法,使得该表示仅与说话人身份相关。这些嵌入通常用于不同说话人的分类与区分。然而,目前尚无客观指标来评估说话人嵌入将说话人身份与其他语音特征分离的能力。这意味着嵌入远非理想状态,高度依赖于训练语料库,并且仍包含一定程度的残差信息,涉及语言内容、录音条件或说话风格等因素。本文对六组基于最新高性能深度神经网络架构提取的说话人嵌入进行分析,重点评估它们真正将说话人身份从语音信号中分离的程度。为正确评估这些架构,研究采用了大规模多说话人平行语音数据集,包含46位说话人在专业录音室或家庭环境中朗读相同提示语料的录音。该分析考察了不同嵌入集内和嵌入集间的相似度度量,以及简单分类与回归方法能否从说话人嵌入中提取若干残差信息因子。结果表明,所分析嵌入的判别能力非常强,但在所有架构中,残差信息仍以与录音条件、语言内容和话语时长高度相关的形式存在于表示中。