Recent advances in deep learning and computer vision have made the synthesis and counterfeiting of multimedia content more accessible than ever, leading to possible threats and dangers from malicious users. In the audio field, we are witnessing the growth of speech deepfake generation techniques, which solicit the development of synthetic speech detection algorithms to counter possible mischievous uses such as frauds or identity thefts. In this paper, we consider three different feature sets proposed in the literature for the synthetic speech detection task and present a model that fuses them, achieving overall better performances with respect to the state-of-the-art solutions. The system was tested on different scenarios and datasets to prove its robustness to anti-forensic attacks and its generalization capabilities.
翻译:近年来,深度学习与计算机视觉的进步使多媒体内容的合成与伪造变得前所未有的便捷,导致恶意用户可能带来威胁与危险。在音频领域,我们正目睹语音深度伪造生成技术的发展,这催生了合成语音检测算法的研发,以应对欺诈或身份盗窃等潜在恶意用途。本文考虑了文献中针对合成语音检测任务提出的三种不同特征集,并提出了一种融合这些特征的模型,在整体性能上优于现有最先进解决方案。该系统在多种场景和数据集上进行了测试,验证了其对反取证攻击的鲁棒性及其泛化能力。