U3-xi: Pushing the Boundaries of Speaker Recognition via Incorporating Uncertainty

An utterance-level speaker embedding is typically obtained by aggregating a sequence of frame-level representations. However, in real-world scenarios, individual frames encode not only speaker-relevant information but also various nuisance factors. As a result, different frames contribute unequally to the final utterance-level speaker representation for Automatic Speaker Verification systems. To address this issue, we propose to estimate the inherent uncertainty of each frame and assign adaptive weights accordingly, where frames with higher uncertainty receive lower attention. Based on this idea, we present U3-xi, a comprehensive framework designed to produce more reliable and interpretable uncertainty estimates for speaker embeddings. Specifically, we introduce several strategies for uncertainty supervision. First, we propose speaker-level uncertainty supervision via a Stochastic Variance Loss, where the distance between an utterance embedding and its corresponding speaker centroid serves as a pseudo ground truth for uncertainty learning. Second, we incorporate global-level uncertainty supervision by injecting the predicted uncertainty into the sof tmax scale during training. This adaptive scaling mechanism adjusts the sharpness of the decision boundary according to sample difficulty, providing global guidance. Third, we redesign the uncertainty estimation module by integrating a Transformer encoder with multi-view self-attention, enabling the model to capture rich local and long-range temporal dependencies. Comprehensive experiments demonstrate that U3-xi is model-agnostic and can be seamlessly applied to various speaker encoders. In particular, when applied to ECAPA-TDNN, it achieves 21.1% and 15.57% relative improvements on the VoxCeleb1 test sets in terms of EER and minDCF, respectively.

翻译：在自动说话人验证系统中，话语级别的说话人嵌入通常通过对一系列帧级别表示进行聚合而获得。然而，在实际场景中，单个帧不仅编码了与说话人相关的信息，还包含了各种干扰因素。因此，不同帧对话语级别最终说话人表征的贡献是不均衡的。为解决此问题，我们提出估计每帧的内在不确定性，并据此分配自适应权重，其中不确定性较高的帧获得较低的注意力。基于这一思想，我们提出了U3-xi，这是一个旨在为说话人嵌入生成更可靠且可解释的不确定性估计的综合框架。具体而言，我们引入了多种不确定性监督策略。首先，我们通过随机方差损失提出说话人级别的不确定性监督，其中话语嵌入与其对应说话人质心之间的距离作为不确定性学习的伪真值。其次，我们通过将预测的不确定性注入训练过程中的softmax尺度，融入了全局级别的不确定性监督。这种自适应缩放机制根据样本难度调整决策边界的锐度，提供全局指导。第三，我们重新设计了不确定性估计模块，通过集成具有多视图自注意力的Transformer编码器，使模型能够捕捉丰富的局部和长程时序依赖关系。全面的实验表明，U3-xi是模型无关的，可以无缝应用于各种说话人编码器。特别地，当应用于ECAPA-TDNN时，在VoxCeleb1测试集上，其在EER和minDCF指标上分别实现了21.1%和15.57%的相对提升。