Speaker verification systems have seen significant advancements with the introduction of Multi-scale Feature Aggregation (MFA) architectures, such as MFA-Conformer and ECAPA-TDNN. These models leverage information from various network depths by concatenating intermediate feature maps before the pooling and projection layers, demonstrating that even shallower feature maps encode valuable speaker-specific information. Building upon this foundation, we propose a Multi-scale Feature Contrastive (MFCon) loss that directly enhances the quality of these intermediate representations. Our MFCon loss applies contrastive learning to all feature maps within the network, encouraging the model to learn more discriminative representations at the intermediate stage itself. By enforcing better feature map learning, we show that the resulting speaker embeddings exhibit increased discriminative power. Our method achieves a 9.05% improvement in equal error rate (EER) compared to the standard MFA-Conformer on the VoxCeleb-1O test set.
翻译:随着多尺度特征聚合(MFA)架构(如MFA-Conformer和ECAPA-TDNN)的引入,说话人验证系统取得了显著进展。这些模型通过在池化和投影层之前拼接中间特征图,利用来自不同网络深度的信息,证明了即使较浅层的特征图也编码了有价值的说话人特定信息。在此基础上,我们提出了一种多尺度特征对比(MFCon)损失,直接提升这些中间表征的质量。我们的MFCon损失将对比学习应用于网络内的所有特征图,促使模型在中间阶段本身学习更具判别性的表征。通过强制实现更好的特征图学习,我们证明所生成的说话人嵌入表现出更强的判别能力。在VoxCeleb-1O测试集上,我们的方法相较于标准MFA-Conformer实现了9.05%的等错误率(EER)提升。