This paper introduces the Multi-scale Feature Aggregation Conformer (MFA-Conformer) structure for audio anti-spoofing countermeasure (CM). MFA-Conformer combines a convolutional neural networkbased on the Transformer, allowing it to aggregate global andlocal information. This may benefit the anti-spoofing CM system to capture the synthetic artifacts hidden both locally and globally. In addition, given the excellent performance of MFA Conformer on automatic speech recognition (ASR) and automatic speaker verification (ASV) tasks, we present a transfer learning method that utilizes pretrained Conformer models on ASR or ASV tasks to enhance the robustness of CM systems. The proposed method is evaluated on both Chinese and Englishs poofing detection databases. On the FAD clean set, the MFA-Conformer model pretrained on the ASR task achieves an EER of 0.038%, which dramatically outperforms the baseline. Moreover, experimental results demonstrate that proposed transfer learning method on Conformer is effective on pure speech segments after voice activity detection processing.
翻译:本文介绍了用于音频防欺骗对策(CM)的多尺度特征聚合Conformer(MFA-Conformer)结构。MFA-Conformer结合了基于Transformer的卷积神经网络,使其能够聚合全局和局部信息。这有助于防欺骗CM系统捕捉隐藏在局部和全局范围内的合成伪影。此外,鉴于MFA-Conformer在自动语音识别(ASR)和自动说话人确认(ASV)任务上的优异表现,我们提出了一种迁移学习方法,利用在ASR或ASV任务上预训练的Conformer模型来增强CM系统的鲁棒性。所提出的方法在中文和英文的欺骗检测数据库上进行了评估。在FAD纯净集上,基于ASR任务预训练的MFA-Conformer模型实现了0.038%的等错误率(EER),显著优于基线模型。此外,实验结果表明,所提出的基于Conformer的迁移学习方法在语音活动检测处理后的纯语音片段上同样有效。