Despite the maturity of modern speaker verification technology, its performance still significantly degrades when facing non-neutrally-phonated (e.g., shouted and whispered) speech. To address this issue, in this paper, we propose a new speaker embedding compensation method based on a minimum mean square error (MMSE) estimator. This method models the joint distribution of the vocal effort transfer vector and non-neutrally-phonated embedding spaces and operates in a principal component analysis domain to cope with non-neutrally-phonated speech data scarcity. Experiments are carried out using a cutting-edge speaker verification system integrating a powerful self-supervised pre-trained model for speech representation. In comparison with a state-of-the-art embedding compensation method, the proposed MMSE estimator yields superior and competitive equal error rate results when tackling shouted and whispered speech, respectively.
翻译:尽管现代说话人验证技术已趋成熟,但在面临非中性发音(例如喊叫和低语)语音时,其性能仍会显著下降。为解决这一问题,本文提出了一种基于最小均方误差(MMSE)估计器的说话人嵌入补偿新方法。该方法对声效传递向量与非中性发音嵌入空间的联合分布进行建模,并在主成分分析域中运行,以应对非中性发音语音数据稀缺的问题。实验采用集成了强大自监督预训练模型的先进说话人验证系统进行语音表征。与最先进的嵌入补偿方法相比,所提出的MMSE估计器在处理喊叫和低语语音时分别获得了更优或具有竞争力的等错误率结果。