The scarcity of labeled far-field speech is a constraint for training superior far-field speaker verification systems. Fine-tuning the model pre-trained on large-scale near-field speech substantially outperforms training from scratch. However, the fine-tuning method suffers from two limitations--catastrophic forgetting and overfitting. In this paper, we propose a weight transfer regularization(WTR) loss to constrain the distance of the weights between the pre-trained model with large-scale near-field speech and the fine-tuned model through a small number of far-field speech. With the WTR loss, the fine-tuning process takes advantage of the previously acquired discriminative ability from the large-scale near-field speech without catastrophic forgetting. Meanwhile, we use the PAC-Bayes generalization theory to analyze the generalization bound of the fine-tuned model with the WTR loss. The analysis result indicates that the WTR term makes the fine-tuned model have a tighter generalization upper bound. Moreover, we explore three kinds of norm distance for weight transfer, which are L1-norm distance, L2-norm distance and Max-norm distance. Finally, we evaluate the effectiveness of the WTR loss on VoxCeleb (pre-trained dataset) and FFSVC (fine-tuned dataset) datasets.
翻译:标记性远场语音的稀缺性限制了高性能远场说话人验证系统的训练。利用大规模近场语音预训练模型进行微调,其性能显著优于从零开始训练。然而,微调方法存在两大局限——灾难性遗忘与过拟合。本文提出一种权重迁移正则化(WTR)损失,通过少量远场语音约束预训练模型与微调模型之间权重的距离,其中预训练模型基于大规模近场语音训练。借助WTR损失,微调过程能够利用先前从大规模近场语音中获取的判别能力,同时避免灾难性遗忘。进一步,我们采用PAC-Bayes泛化理论分析引入WTR损失的微调模型泛化界,分析结果表明WTR项使得微调模型具有更紧凑的泛化上界。此外,我们探索了三种用于权重迁移的范数距离:L1范数距离、L2范数距离和最大范数距离。最后,在VoxCeleb(预训练数据集)和FFSVC(微调数据集)上验证了WTR损失的有效性。