The current speech anti-spoofing countermeasures (CMs) show excellent performance on specific datasets. However, removing the silence of test speech through Voice Activity Detection (VAD) can severely degrade performance. In this paper, the impact of silence on speech anti-spoofing is analyzed. First, the reasons for the impact are explored, including the proportion of silence duration and the content of silence. The proportion of silence duration in spoof speech generated by text-to-speech (TTS) algorithms is lower than that in bonafide speech. And the content of silence generated by different waveform generators varies compared to bonafide speech. Then the impact of silence on model prediction is explored. Even after retraining, the spoof speech generated by neural network based end-to-end TTS algorithms suffers a significant rise in error rates when the silence is removed. To demonstrate the reasons for the impact of silence on CMs, the attention distribution of a CM is visualized through class activation mapping (CAM). Furthermore, the implementation and analysis of the experiments masking silence or non-silence demonstrates the significance of the proportion of silence duration for detecting TTS and the importance of silence content for detecting voice conversion (VC). Based on the experimental results, improving the robustness of CMs against unknown spoofing attacks by masking silence is also proposed. Finally, the attacks on anti-spoofing CMs through concatenating silence, and the mitigation of VAD and silence attack through low-pass filtering are introduced.
翻译:当前语音防欺骗的对抗措施(CMs)在特定数据集上表现出色。然而,通过语音活动检测(VAD)去除测试语音中的静音成分会严重降低其性能。本文分析了沉默对语音防欺骗的影响。首先,探究了产生这种影响的原因,包括沉默时长比例和沉默内容。文本转语音(TTS)算法生成的欺骗语音中沉默时长比例低于真实语音。同时,不同波形生成器产生的沉默内容与真实语音存在差异。随后,探讨了沉默对模型预测的影响。即使经过重新训练,基于神经网络的端到端TTS算法生成的欺骗语音在去除沉默后错误率仍显著上升。为阐明沉默影响对抗措施的原因,通过类激活映射(CAM)可视化了CM的注意力分布。此外,对沉默或非沉默区域进行掩码实验的实施与分析表明:沉默时长比例对于检测TTS具有显著作用,而沉默内容对于检测语音转换(VC)至关重要。基于实验结果,提出了通过掩码沉默提升CM对未知欺骗攻击鲁棒性的方法。最后,介绍了通过拼接沉默对防欺骗CM发起攻击,以及通过低通滤波缓解VAD和沉默攻击的技术方案。