This paper is the system description of the DKU-MSXF System for the track1, track2 and track3 of the VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC-23). For Track 1, we utilize a network structure based on ResNet for training. By constructing a cross-age QMF training set, we achieve a substantial improvement in system performance. For Track 2, we inherite the pre-trained model from Track 1 and conducte mixed training by incorporating the VoxBlink-clean dataset. In comparison to Track 1, the models incorporating VoxBlink-clean data exhibit a performance improvement by more than 10% relatively. For Track3, the semi-supervised domain adaptation task, a novel pseudo-labeling method based on triple thresholds and sub-center purification is adopted to make domain adaptation. The final submission achieves mDCF of 0.1243 in task1, mDCF of 0.1165 in Track 2 and EER of 4.952% in Track 3.
翻译:本文描述了DKU-MSXF系统在VoxCeleb说话人识别挑战赛2023(VoxSRC-23)中Track1、Track2和Track3任务上的系统方案。针对Track1,我们采用基于ResNet的网络结构进行训练,通过构建跨年龄QMF训练集,系统性能获得显著提升。针对Track2,我们继承Track1的预训练模型,并引入VoxBlink-clean数据集进行混合训练。相较于Track1,融合VoxBlink-clean数据的模型性能相对提升超过10%。针对Track3这一半监督域适应任务,我们采用基于三阈值与子中心净化技术的新型伪标签方法实现域适应。最终提交结果在Track1上达到mDCF为0.1243,Track2上mDCF为0.1165,Track3上EER为4.952%。