Audio Deepfake Detection (ADD) aims to detect the fake audio generated by text-to-speech (TTS), voice conversion (VC) and replay, etc., which is an emerging topic. Traditionally we take the mono signal as input and focus on robust feature extraction and effective classifier design. However, the dual-channel stereo information in the audio signal also includes important cues for deepfake, which has not been studied in the prior work. In this paper, we propose a novel ADD model, termed as M2S-ADD, that attempts to discover audio authenticity cues during the mono-to-stereo conversion process. We first projects the mono to a stereo signal using a pretrained stereo synthesizer, then employs a dual-branch neural architecture to process the left and right channel signals, respectively. In this way, we effectively reveal the artifacts in the fake audio, thus improve the ADD performance. The experiments on the ASVspoof2019 database show that M2S-ADD outperforms all baselines that input mono. We release the source code at \url{https://github.com/AI-S2-Lab/M2S-ADD}.
翻译:音频深度伪造检测(ADD)旨在检测由文本到语音(TTS)、语音转换(VC)及重放等方式生成的伪造音频,这是一个新兴的研究课题。传统方法通常以单声道信号作为输入,专注于鲁棒特征提取和高效分类器设计。然而,音频信号中的双声道立体声信息也包含重要的深度伪造线索,但此前研究尚未对此加以探讨。本文提出了一种名为M2S-ADD的新型ADD模型,该模型尝试在从单声道到立体声的转换过程中发现音频的真实性线索。我们首先使用预训练的立体声合成器将单声道信号投影为立体声信号,随后采用双分支神经架构分别处理左声道和右声道信号。通过这种方式,我们有效揭示了伪造音频中的伪影,从而提升了ADD性能。在ASVspoof2019数据库上的实验表明,M2S-ADD的性能优于所有以单声道为输入的基线模型。我们已在\url{https://github.com/AI-S2-Lab/M2S-ADD}上开源了源代码。