Meta-learning has recently become a research hotspot in speaker verification (SV). We introduce two methods to improve the meta-learning training for SV in this paper. For the first method, a backbone embedding network is first jointly trained with the conventional cross entropy loss and prototypical networks (PN) loss. Then, inspired by speaker adaptive training in speech recognition, additional transformation coefficients are trained with only the PN loss. The transformation coefficients are used to modify the original backbone embedding network in the x-vector extraction process. Furthermore, the random erasing data augmentation technique is applied to all support samples in each episode to construct positive pairs, and a contrastive loss between the augmented and the original support samples is added to the objective in model training. Experiments are carried out on the SITW and VOiCES databases. Both of the methods can obtain consistent improvements over existing meta-learning training frameworks. By combining these two methods, we can observe further improvements on these two databases.
翻译:元学习近期已成为说话人验证(SV)领域的研究热点。本文提出了两种改进元学习训练的方法。第一种方法先使用传统交叉熵损失和原型网络(PN)损失联合训练骨干嵌入网络。随后,受语音识别中说话人自适应训练启发,仅使用PN损失训练额外的变换系数。这些变换系数用于在x-vector提取过程中修改原始骨干嵌入网络。此外,在每个训练回合中对所有支持样本应用随机擦除数据增强技术构建正样本对,并在模型训练的目标函数中引入增强样本与原始支持样本之间的对比损失。在SITW和VOiCES数据库上开展的实验表明,两种方法均能较现有元学习训练框架取得一致性改进。将这两种方法结合后,这两个数据库上的性能可获得进一步提升。