There are various factors that can influence the performance of speaker recognition systems, such as emotion, language and other speaker-related or context-related variations. Since individual speech frames do not contribute equally to the utterance-level representation, it is essential to estimate the importance or reliability of each frame. The xi-vector model addresses this by assigning different weights to frames based on uncertainty estimation. However, its uncertainty estimation model is implicitly trained through classification loss alone and does not consider the temporal relationships between frames, which may lead to suboptimal supervision. In this paper, we propose an improved architecture, xi+. Compared to xi-vector, xi+ incorporates a temporal attention module to capture frame-level uncertainty in a context-aware manner. In addition, we introduce a novel loss function, Stochastic Variance Loss, which explicitly supervises the learning of uncertainty. Results demonstrate consistent performance improvements of about 10\% on the VoxCeleb1-O set and 11\% on the NIST SRE 2024 evaluation set.
翻译:影响说话人识别系统性能的因素众多,例如情感、语言以及其他与说话人或上下文相关的变异。由于单个语音帧对话语级表征的贡献并不均等,因此估计每一帧的重要性或可靠性至关重要。xi-vector模型通过基于不确定性估计为不同帧分配不同权重来解决这一问题。然而,其不确定性估计模型仅通过分类损失进行隐式训练,且未考虑帧间的时序关系,这可能导致监督效果欠佳。本文提出一种改进架构xi+。相较于xi-vector,xi+引入了一个时序注意力模块,以上下文感知的方式捕捉帧级不确定性。此外,我们提出了一种新颖的损失函数——随机方差损失,用于显式监督不确定性的学习。实验结果表明,在VoxCeleb1-O数据集上性能持续提升约10%,在NIST SRE 2024评估集上提升约11%。