Uncertainty modeling in speaker representation aims to learn the variability present in speech utterances. While the conventional cosine-scoring is computationally efficient and prevalent in speaker recognition, it lacks the capability to handle uncertainty. To address this challenge, this paper proposes an approach for estimating uncertainty at the speaker embedding front-end and propagating it to the cosine scoring back-end. Experiments conducted on the VoxCeleb and SITW datasets confirmed the efficacy of the proposed method in handling uncertainty arising from embedding estimation. It achieved improvement with 8.5% and 9.8% average reductions in EER and minDCF compared to the conventional cosine similarity. It is also computationally efficient in practice.
翻译:说话人表示中的不确定性建模旨在学习语音片段中存在的变异性。传统的余弦评分虽然在说话人识别中计算高效且广泛使用,但缺乏处理不确定性的能力。为应对这一挑战,本文提出了一种方法,在说话人嵌入前端估计不确定性,并将其传播至余弦评分后端。在VoxCeleb和SITW数据集上进行的实验证实了所提方法在处理由嵌入估计引起的不确定性方面的有效性。与传统的余弦相似度相比,该方法在EER和minDCF上分别实现了平均8.5%和9.8%的降低。同时,它在实际应用中计算效率也较高。