A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection

Thanks to advancements in deep learning, speech generation systems now power a variety of real-world applications, such as text-to-speech for individuals with speech disorders, voice chatbots in call centers, cross-linguistic speech translation, etc. While these systems can autonomously generate human-like speech and replicate specific voices, they also pose risks when misused for malicious purposes. This motivates the research community to develop models for detecting synthesized speech (e.g., fake speech) generated by deep-learning-based models, referred to as the Deepfake Speech Detection task. As the Deepfake Speech Detection task has emerged in recent years, there are not many survey papers proposed for this task. Additionally, existing surveys for the Deepfake Speech Detection task tend to summarize techniques used to construct a Deepfake Speech Detection system rather than providing a thorough analysis. This gap motivated us to conduct a comprehensive survey, providing a critical analysis of the challenges and developments in Deepfake Speech Detection. Our survey is innovatively structured, offering an in-depth analysis of current challenge competitions, public datasets, and the deep-learning techniques that provide enhanced solutions to address existing challenges in the field. From our analysis, we propose hypotheses on leveraging and combining specific deep learning techniques to improve the effectiveness of Deepfake Speech Detection systems. Beyond conducting a survey, we perform extensive experiments to validate these hypotheses and propose a highly competitive model for the task of Deepfake Speech Detection. Given the analysis and the experimental results, we finally indicate potential and promising research directions for the Deepfake Speech Detection task.

翻译：得益于深度学习的进步，语音生成系统现已赋能多种现实应用，例如面向言语障碍者的文本转语音系统、呼叫中心的语音聊天机器人、跨语言语音翻译等。尽管这些系统能够自主生成类人语音并复制特定音色，但当其被滥用于恶意目的时也会带来风险。这促使研究界开发用于检测基于深度学习的模型所生成的合成语音（例如伪造语音）的模型，该任务被称为深度伪造语音检测。由于深度伪造语音检测任务近年来才兴起，目前针对该任务的综述论文并不多。此外，现有的深度伪造语音检测综述往往侧重于总结构建检测系统所用的技术，而非提供深入分析。这一空白促使我们开展一项全面综述，对深度伪造语音检测领域的挑战与发展进行批判性分析。我们的综述在结构上具有创新性，深入分析了当前的挑战竞赛、公开数据集以及为解决该领域现有挑战提供增强解决方案的深度学习技术。基于分析，我们提出了关于利用和结合特定深度学习技术以提升深度伪造语音检测系统有效性的假设。除进行综述外，我们还开展了大量实验以验证这些假设，并为深度伪造语音检测任务提出了一个极具竞争力的模型。基于分析与实验结果，我们最终指出了深度伪造语音检测任务潜在且富有前景的研究方向。