Generative audio technologies now enable highly realistic voice cloning and real-time voice conversion, increasing the risk of impersonation, fraud, and misinformation in communication channels such as phone and video calls. This study investigates real-time detection of AI-generated speech produced using Retrieval-based Voice Conversion (RVC), evaluated on the DEEP-VOICE dataset, which includes authentic and voice-converted speech samples from multiple well-known speakers. To simulate realistic conditions, deepfake generation is applied to isolated vocal components, followed by the reintroduction of background ambiance to suppress trivial artifacts and emphasize conversion-specific cues. We frame detection as a streaming classification task by dividing audio into one-second segments, extracting time-frequency and cepstral features, and training supervised machine learning models to classify each segment as real or voice-converted. The proposed system enables low-latency inference, supporting both segment-level decisions and call-level aggregation. Experimental results show that short-window acoustic features can reliably capture discriminative patterns associated with RVC speech, even in noisy backgrounds. These findings demonstrate the feasibility of practical, real-time deepfake speech detection and underscore the importance of evaluating under realistic audio mixing conditions for robust deployment.
翻译:生成式音频技术现已能够实现高度逼真的语音克隆与实时语音转换,从而在电话及视频通话等通信渠道中增加了冒充、欺诈与虚假信息的风险。本研究探讨了基于检索的语音转换(RVC)生成的AI语音的实时检测方法,并在DEEP-VOICE数据集上进行了评估。该数据集包含来自多位知名说话人的真实语音样本与语音转换样本。为模拟真实场景,本研究对分离出的语音成分进行深度伪造生成,随后重新引入环境背景音以抑制细微伪影并突出转换特有的线索。我们将检测任务构建为流式分类问题:将音频分割为一秒长的片段,提取时频特征与倒谱特征,并训练有监督的机器学习模型以将每个片段分类为真实语音或转换语音。所提出的系统支持低延迟推理,既可实现片段级判定,也支持通话级结果聚合。实验结果表明,即使在含噪背景下,短时窗声学特征仍能可靠地捕捉与RVC语音相关的判别性模式。这些发现证明了实用化实时深度伪造语音检测的可行性,并强调了在真实音频混合条件下进行评估对于实现稳健部署的重要性。