Synthetic-voice cloning technologies have seen significant advances in recent years, giving rise to a range of potential harms. From small- and large-scale financial fraud to disinformation campaigns, the need for reliable methods to differentiate real and synthesized voices is imperative. We describe three techniques for differentiating a real from a cloned voice designed to impersonate a specific person. These three approaches differ in their feature extraction stage with low-dimensional perceptual features offering high interpretability but lower accuracy, to generic spectral features, and end-to-end learned features offering less interpretability but higher accuracy. We show the efficacy of these approaches when trained on a single speaker's voice and when trained on multiple voices. The learned features consistently yield an equal error rate between $0\%$ and $4\%$, and are reasonably robust to adversarial laundering.
翻译:合成语音克隆技术近年来取得了显著进展,引发了诸多潜在危害。从中小规模金融欺诈到虚假信息宣传活动,迫切需要可靠的方法来区分真实语音与合成语音。本文描述了三种用于区分真实语音与针对特定人物模仿的克隆语音的技术。这三种方法在特征提取阶段存在差异:低维感知特征具有高可解释性但精度较低,通用频谱特征居中,端到端学习特征可解释性较低但精度更高。我们展示了这些方法在单人和多人语音训练集上的有效性。学习特征始终能达到$0\%$至$4\%$的等错误率,并且对对抗性清洗具有合理的鲁棒性。