The rapid advancement of generative AI has made audio deepfakes increasingly indistinguishable from authentic human vocals, posing significant threats to persons-of-interest (POI) such as public figures. Current detection systems primarily rely on generic, black-box models that fail to capture speaker-specific idiosyncratic traits and lack interpretability. In this paper, we propose Phoneme-based Voice Profiling (PVP), a novel personalized defense framework. By shifting the detection paradigm from macro-utterance analysis to micro-phonetic modeling, PVP captures the unique acoustic distributions underlying a POI's habitual articulatory patterns. Specifically, our framework models speaker-specific phonetic realizations using lightweight Gaussian Mixture Models (GMMs) estimated solely from bona fide reference speech. This design enables data-efficient profiling and robust generalization to previously unseen spoofing attacks without requiring heavy spoof-specific training. Furthermore, we introduce the first large-scale Chinese POI deepfake dataset to benchmark speaker-specific detection. Experimental results demonstrate that PVP significantly outperforms state-of-the-art generic detectors in POI spoofing scenarios, achieving substantial EER reductions while providing fine-grained, phoneme-level interpretability for forensic analysis. Code and data are available at: https://github.com/JunXue-tech/PVP
翻译:生成式人工智能的快速发展使得音频深度伪造在真实性上日益逼近真实人声,对公众人物等关注对象构成重大威胁。现有检测系统主要依赖通用型黑盒模型,既无法捕捉说话人特有的个性化特征,也缺乏可解释性。本文提出一种基于音素的语音画像(PVP)新框架,它将检测范式从宏观话语分析转向微观音素建模,通过捕获关注对象习惯性发音模式背后的独特声学分布来实现个性化防御。具体地,该框架利用仅从参考真实语音中估计的轻量级高斯混合模型,对说话人特有的音素实现进行建模。这种设计能够实现数据高效的语音画像,并在无需大量伪造样本训练的情况下,稳健泛化至未见过的欺骗攻击。此外,我们构建了首个大规模中文关注对象深度伪造数据集,用于基准测试说话人特异性检测。实验结果表明,在关注对象欺骗场景下,PVP显著优于最先进的通用检测器,在实现等错误率大幅降低的同时,还能提供细粒度的音素级可解释性以支持取证分析。代码与数据公开于:https://github.com/JunXue-tech/PVP