With a recent influx of voice generation methods, the threat introduced by audio DeepFake (DF) is ever-increasing. Several different detection methods have been presented as a countermeasure. Many methods are based on so-called front-ends, which, by transforming the raw audio, emphasize features crucial for assessing the genuineness of the audio sample. Our contribution contains investigating the influence of the state-of-the-art Whisper automatic speech recognition model as a DF detection front-end. We compare various combinations of Whisper and well-established front-ends by training 3 detection models (LCNN, SpecRNet, and MesoNet) on a widely used ASVspoof 2021 DF dataset and later evaluating them on the DF In-The-Wild dataset. We show that using Whisper-based features improves the detection for each model and outperforms recent results on the In-The-Wild dataset by reducing Equal Error Rate by 21%.
翻译:随着语音生成方法的快速发展,音频深度伪造(DeepFake,DF)带来的安全威胁与日俱增。针对此类威胁,研究者提出了多种检测方法作为防御手段。其中许多方法采用所谓的前端技术,通过对原始音频进行变换处理,突出评估音频样本真实性所需的关键特征。本研究的创新之处在于:探索将先进的Whisper自动语音识别模型作为深度伪造检测前端的效果。我们通过训练三类检测模型(LCNN、SpecRNet和MesoNet),在广泛使用的ASVspoof 2021 DF数据集上对比Whisper与多种成熟前端的组合方案,并在DF In-The-Wild数据集上进行评估。实验表明:基于Whisper的特征提取方法可使每个检测模型的性能均有所提升,并且在In-The-Wild数据集上将等错误率降低21%,从而刷新了该数据集上的现有最佳结果。