This study presents a comparative analysis between the speaker embeddings of speech foundation models and human subjective perception of speaker similarity. Human listeners have the ability to judge speaker similarity on a continuous scale discerning how similar two voices are. In contrast, speech foundation models embed speaker characteristics into numerical representation. However, a question remains: does the numerical distance between speaker embeddings in these models truly align with the similarity perceived by humans? To address this, we conduct a comprehensive investigation using more than 40 models to compare model-derived distances with human-perceived similarity scores. Furthermore, we identify which factors in model configuration contribute most to a speaker embedding that mirrors human perception. Our findings provide insights for the development of more perceptually grounded speech foundation models.
翻译:本研究对语音基础模型的说话者嵌入表示与人类对说话者相似性的主观感知进行了比较分析。人类听众能够通过连续尺度判断说话者相似性,辨别两个声音的相似程度。相比之下,语音基础模型将说话者特征编码为数值表示。但问题在于:这些模型中说话者嵌入之间的数值距离是否真正与人类感知的相似性一致?为探究此问题,我们利用40余种模型进行了全面研究,将模型生成的相似度距离与人类感知的评分进行对比。此外,我们还识别出模型配置中哪些因素对形成反映人类感知的说话者嵌入贡献最大。研究结果为开发更具感知合理性的语音基础模型提供了重要启示。