The Automatic Speaker Verification (ASV) system is vulnerable to fraudulent activities using audio deepfakes, also known as logical-access voice spoofing attacks. These deepfakes pose a concerning threat to voice biometrics due to recent advancements in generative AI and speech synthesis technologies. While several deep learning models for speech synthesis detection have been developed, most of them show poor generalizability, especially when the attacks have different statistical distributions from the ones seen. Therefore, this paper presents Quick-SpoofNet, an approach for detecting both seen and unseen synthetic attacks in the ASV system using one-shot learning and metric learning techniques. By using the effective spectral feature set, the proposed method extracts compact and representative temporal embeddings from the voice samples and utilizes metric learning and triplet loss to assess the similarity index and distinguish different embeddings. The system effectively clusters similar speech embeddings, classifying bona fide speeches as the target class and identifying other clusters as spoofing attacks. The proposed system is evaluated using the ASVspoof 2019 logical access (LA) dataset and tested against unseen deepfake attacks from the ASVspoof 2021 dataset. Additionally, its generalization ability towards unseen bona fide speech is assessed using speech data from the VSDC dataset.
翻译:自动说话人验证(ASV)系统易受利用音频深度伪造(亦称逻辑接入语音欺骗攻击)实施的欺诈活动影响。由于生成式人工智能与语音合成技术的近期进展,此类深度伪造对语音生物特征构成了严峻威胁。尽管已有多种面向语音合成检测的深度学习模型被提出,但大多数模型泛化能力较差,尤其在攻击样本统计分布与训练数据分布存在差异时表现更甚。为此,本文提出Quick-SpoofNet方法,通过融合一次性学习与度量学习技术,实现对ASV系统中已知及未知合成攻击的检测。该方法基于高效频谱特征集,从语音样本中提取紧凑且具代表性的时序嵌入表示,并利用度量学习与三元组损失函数评估相似性指数、区分不同嵌入特征。系统可有效聚类相似语音嵌入:将真实语音归为目标类别,而将其他聚类识别为欺骗攻击。采用ASVspoof 2019逻辑接入(LA)数据集对所提系统进行评估,并使用ASVspoof 2021数据集中未知深度伪造攻击进行测试。此外,利用VSDC数据集中的真实语音数据评估其对未知真实语音的泛化能力。