A good training set for speech spoofing countermeasures requires diverse TTS and VC spoofing attacks, but generating TTS and VC spoofed trials for a target speaker may be technically demanding. Instead of using full-fledged TTS and VC systems, this study uses neural-network-based vocoders to do copy-synthesis on bona fide utterances. The output data can be used as spoofed data. To make better use of pairs of bona fide and spoofed data, this study introduces a contrastive feature loss that can be plugged into the standard training criterion. On the basis of the bona fide trials from the ASVspoof 2019 logical access training set, this study empirically compared a few training sets created in the proposed manner using a few neural non-autoregressive vocoders. Results on multiple test sets suggest good practices such as fine-tuning neural vocoders using bona fide data from the target domain. The results also demonstrated the effectiveness of the contrastive feature loss. Combining the best practices, the trained CM achieved overall competitive performance. Its EERs on the ASVspoof 2021 hidden subsets also outperformed the top-1 challenge submission.
翻译:语音欺骗反制系统的良好训练集需要多样化的文本转语音(TTS)和语音转换(VC)欺骗攻击,但为目标说话人生成TTS和VC欺骗样本在技术上可能具有挑战性。本研究未采用完整的TTS和VC系统,而是利用基于神经网络的声码器对真实语音进行复制合成,将输出数据作为欺骗数据。为更好利用真实-欺骗数据对,本文提出了一种可嵌入标准训练准则的对比特征损失函数。基于ASVspoof 2019逻辑访问训练集中的真实样本,本研究使用多种非自回归神经声码器,通过所提方法创建了若干训练集并进行实证比较。在多个测试集上的结果表明,使用目标域真实数据微调神经声码器是有效实践。结果同时验证了对比特征损失函数的有效性。结合最优实践方案,训练出的反制模型取得了具有竞争力的整体性能,其在ASVspoof 2021隐藏子集上的等错误率(EER)甚至超越了挑战赛最优提交结果。