A good training set for speech spoofing countermeasures requires diverse TTS and VC spoofing attacks, but generating TTS and VC spoofed trials for a target speaker may be technically demanding. Instead of using full-fledged TTS and VC systems, this study uses neural-network-based vocoders to do copy-synthesis on bona fide utterances. The output data can be used as spoofed data. To make better use of pairs of bona fide and spoofed data, this study introduces a contrastive feature loss that can be plugged into the standard training criterion. On the basis of the bona fide trials from the ASVspoof 2019 logical access training set, this study empirically compared a few training sets created in the proposed manner using a few neural non-autoregressive vocoders. Results on multiple test sets suggest good practices such as fine-tuning neural vocoders using bona fide data from the target domain. The results also demonstrated the effectiveness of the contrastive feature loss. Combining the best practices, the trained CM achieved overall competitive performance. Its EERs on the ASVspoof 2021 hidden subsets also outperformed the top-1 challenge submission.
翻译:针对语音欺骗检测,良好的训练集需要多样化的文本转语音(TTS)与语音转换(VC)欺骗攻击,但为目标说话人生成此类伪造样本在技术上可能存在挑战。本研究未采用完整的TTS与VC系统,而是利用基于神经网络的声码器对真实语音进行复制合成,将输出数据作为伪造数据。为更充分利用真实-伪造语音对,本研究提出一种可嵌入标准训练准则的对比特征损失函数。基于ASVspoof 2019逻辑访问训练集中的真实样本,我们通过多种神经非自回归声码器,以所提方式创建若干训练集并进行实验比较。多测试集结果表明,使用目标域真实数据微调神经声码器是有效实践,同时验证了对比特征损失函数的有效性。结合最优实践,训练后的检测模型取得了具有竞争力的综合性能,在ASVspoof 2021隐藏子集上的等错误率(EER)甚至优于该挑战的冠军方案。