This paper presents a system for the 2024 Text-Dependent Speaker Verification (TdSV) Challenge. The system achieved a Minimum Detection Cost Function (MinDCF) of 0.0461 and an Equal Error Rate (EER) of 1.3\%. Our approach focused on adapting existing state-of-the-art neural networks, ResNet-TDNN and NeXt-TDNN, originally trained on the VoxCeleb dataset. This strategy was chosen because of the limited challenge duration and the available resources at the time. In addition, we designed a lightweight and resource-efficient model, EfficientNet-A0, trained specifically on the challenge dataset to improve adaptation and strengthen the ensemble approach. Our system combines advanced neural architectures, extensive data augmentation, and optimised hyperparameters. These components helped achieve strong performance in text-dependent speaker verification. The results also demonstrate the effectiveness of multi-model ensemble learning for both speaker and phrase verification.
翻译:本文介绍了面向2024年文本相关说话人确认(TdSV)挑战赛的系统方案。该系统实现了最小检测代价函数(MinDCF)0.0461和等错误率(EER)1.3%。考虑到挑战赛周期有限及当时可用资源,我们的方法聚焦于对预先在VoxCeleb数据集上训练的现有最优神经网络——ResNet-TDNN和NeXt-TDNN——进行适配调整。此外,我们设计了一个轻量级且资源高效的模型EfficientNet-A0,专门基于挑战赛数据集进行训练,以提升适配能力并增强集成方案。系统融合了先进神经架构、广泛数据增强技术及优化超参数,这些组件共同助力在文本相关说话人确认任务中取得优异表现。实验结果同时表明,多模型集成学习在说话人与短语验证两方面均具有效性。