Recent advances in Text-to-Speech (TTS) systems have substantially increased the realism of synthetic speech, raising new challenges for audio deepfake detection. This work presents a comparative evaluation of three state-of-the-art TTS models--Dia2, Maya1, and MeloTTS--representing streaming, LLM-based, and non-autoregressive architectures. A corpus of 12,000 synthetic audio samples was generated using the Daily-Dialog dataset and evaluated against four detection frameworks, including semantic, structural, and signal-level approaches. The results reveal significant variability in detector performance across generative mechanisms: models effective against one TTS architecture may fail against others, particularly LLM-based synthesis. In contrast, a multi-view detection approach combining complementary analysis levels demonstrates robust performance across all evaluated models. These findings highlight the limitations of single-paradigm detectors and emphasize the necessity of integrated detection strategies to address the evolving landscape of audio deepfake threats.
翻译:近年来,文本转语音(TTS)系统的显著进步极大地提升了合成语音的真实感,这为音频深度伪造检测带来了新的挑战。本研究对三种最先进的TTS模型——Dia2、Maya1和MeloTTS(分别代表流式、基于大语言模型和非自回归架构)进行了比较性评估。我们利用Daily-Dialog数据集生成了一个包含12,000个合成音频样本的语料库,并在四个检测框架(包括语义、结构和信号层面的方法)上对其进行了评估。结果显示,检测器性能在不同生成机制间存在显著差异:对某一TTS架构有效的检测模型可能对其他架构(尤其是基于大语言模型的合成方法)失效。相比之下,一种融合了互补分析层次的多视角检测方法在所有评估模型中均表现出稳健的性能。这些发现凸显了单一范式检测器的局限性,并强调了采用集成检测策略以应对不断演变的音频深度伪造威胁的必要性。