Deep learning models are widely used for speaker recognition and spoofing speech detection. We propose the GMM-ResNet2 for synthesis speech detection. Compared with the previous GMM-ResNet model, GMM-ResNet2 has four improvements. Firstly, the different order GMMs have different capabilities to form smooth approximations to the feature distribution, and multiple GMMs are used to extract multi-scale Log Gaussian Probability features. Secondly, the grouping technique is used to improve the classification accuracy by exposing the group cardinality while reducing both the number of parameters and the training time. The final score is obtained by ensemble of all group classifier outputs using the averaging method. Thirdly, the residual block is improved by including one activation function and one batch normalization layer. Finally, an ensemble-aware loss function is proposed to integrate the independent loss functions of all ensemble members. On the ASVspoof 2019 LA task, the GMM-ResNet2 achieves a minimum t-DCF of 0.0227 and an EER of 0.79\%. On the ASVspoof 2021 LA task, the GMM-ResNet2 achieves a minimum t-DCF of 0.2362 and an EER of 2.19\%, and represents a relative reductions of 31.4\% and 76.3\% compared with the LFCC-LCNN baseline.
翻译:深度学习模型被广泛应用于说话人识别和欺骗语音检测。我们提出了用于合成语音检测的GMM-ResNet2模型。与先前的GMM-ResNet模型相比,GMM-ResNet2具有四项改进。首先,不同阶数的高斯混合模型(GMM)在形成特征分布的平滑近似方面具有不同能力,我们采用多个GMM来提取多尺度对数高斯概率特征。其次,通过引入分组技术,在暴露组基数的同时降低了参数数量和训练时间,从而提高了分类准确率。最终得分采用平均法集成所有分组分类器的输出获得。第三,通过增加一个激活函数和一个批量归一化层改进了残差块结构。最后,提出了一种集成感知损失函数,用于整合所有集成成员的独立损失函数。在ASVspoof 2019 LA任务中,GMM-ResNet2实现了0.0227的最小t-DCF和0.79%的等错误率(EER)。在ASVspoof 2021 LA任务中,该模型获得了0.2362的最小t-DCF和2.19%的EER,与LFCC-LCNN基线相比分别实现了31.4%和76.3%的相对降低。