In this paper, we aim to address the problem of channel robustness in speech countermeasure (CM) systems, which are used to distinguish synthetic speech from human natural speech. On the basis of two hypotheses, we suggest an approach for perturbing phase information during the training of time-domain CM systems. Communication networks often employ lossy compression codec that encodes only magnitude information, therefore heavily altering phase information. Also, state-of-the-art CM systems rely on phase information to identify spoofed speech. Thus, we believe the information loss in the phase domain induced by lossy compression codec degrades the performance of the unseen channel. We first establish the dependence of time-domain CM systems on phase information by perturbing phase in evaluation, showing strong degradation. Then, we demonstrated that perturbing phase during training leads to a significant performance improvement, whereas perturbing magnitude leads to further degradation.
翻译:本文旨在解决语音反欺骗(CM)系统中的信道鲁棒性问题——该类系统用于区分合成语音与人类自然语音。基于两个假设,我们提出在时域CM系统训练过程中对相位信息进行扰动的方法。通信网络常采用仅编码幅度信息的无损压缩编解码器,这会显著改变相位信息。同时,当前最先进的CM系统依赖相位信息来识别欺骗语音。因此,我们认为有损压缩编解码器导致的相位域信息损失会降低系统对未知信道的性能表现。我们首先通过评估阶段扰动相位的方式,验证了时域CM系统对相位信息的依赖性——实验显示系统性能显著下降。随后证明,训练阶段对相位信息进行扰动可带来显著的性能提升,而扰动幅度信息则会进一步导致性能恶化。