Multi-Speaker Conversational Audio Deepfake: Taxonomy, Dataset and Pilot Study

The rapid advances in text-to-speech (TTS) technologies have made audio deepfakes increasingly realistic and accessible, raising significant security and trust concerns. While existing research has largely focused on detecting single-speaker audio deepfakes, real-world malicious applications with multi-speaker conversational settings is also emerging as a major underexplored threat. To address this gap, we propose a conceptual taxonomy of multi-speaker conversational audio deepfakes, distinguishing between partial manipulations (one or multiple speakers altered) and full manipulations (entire conversations synthesized). As a first step, we introduce a new Multi-speaker Conversational Audio Deepfakes Dataset (MsCADD) of 2,830 audio clips containing real and fully synthetic two-speaker conversations, generated using VITS and SoundStorm-based NotebookLM models to simulate natural dialogue with variations in speaker gender, and conversational spontaneity. MsCADD is limited to text-to-speech (TTS) types of deepfake. We benchmark three neural baseline models; LFCC-LCNN, RawNet2, and Wav2Vec 2.0 on this dataset and report performance in terms of F1 score, accuracy, true positive rate (TPR), and true negative rate (TNR). Results show that these baseline models provided a useful benchmark, however, the results also highlight that there is a significant gap in multi-speaker deepfake research in reliably detecting synthetic voices under varied conversational dynamics. Our dataset and benchmarks provide a foundation for future research on deepfake detection in conversational scenarios, which is a highly underexplored area of research but also a major area of threat to trustworthy information in audio settings. The MsCADD dataset is publicly available to support reproducibility and benchmarking by the research community.

翻译：文本转语音（TTS）技术的快速发展使得音频深度伪造日益逼真且易于获取，引发了重大的安全与信任担忧。现有研究主要集中于检测单说话人音频深度伪造，而现实世界中具有多说话人对话场景的恶意应用也正成为一个尚未被充分探索的重大威胁。为填补这一空白，我们提出了多说话人对话音频深度伪造的概念分类体系，区分了部分篡改（一个或多个说话人被修改）与完全篡改（整个对话被合成）。作为第一步，我们引入了一个新的多说话人对话音频深度伪造数据集（MsCADD），包含2,830个音频片段，涵盖真实及完全合成的双说话人对话。这些合成对话使用基于VITS和SoundStorm的NotebookLM模型生成，以模拟具有说话人性别和对话自然度变化的自然对话。MsCADD仅限于文本转语音（TTS）类型的深度伪造。我们在该数据集上对三种神经基线模型（LFCC-LCNN、RawNet2和Wav2Vec 2.0）进行了基准测试，并以F1分数、准确率、真阳性率（TPR）和真阴性率（TNR）报告了性能。结果表明，这些基线模型提供了有用的基准，但结果也突显了在多说话人深度伪造研究中，可靠检测不同对话动态下的合成语音方面存在显著差距。我们的数据集和基准测试为未来对话场景中的深度伪造检测研究奠定了基础，这是一个高度未被充分探索的研究领域，同时也是音频环境中可信信息面临重大威胁的领域。MsCADD数据集已公开提供，以支持研究界的可重复性与基准测试。