Spectral Masking and Interpolation Attack (SMIA): A Black-box Adversarial Attack against Voice Authentication and Anti-Spoofing Systems

Voice Authentication Systems (VAS) use unique vocal characteristics for verification. They are increasingly integrated into high-security sectors such as banking and healthcare. Despite their improvements using deep learning, they face severe vulnerabilities from sophisticated threats like deepfakes and adversarial attacks. The emergence of realistic voice cloning complicates detection, as systems struggle to distinguish authentic from synthetic audio. While anti-spoofing countermeasures (CMs) exist to mitigate these risks, many rely on static detection models that can be bypassed by novel adversarial methods, leaving a critical security gap. To demonstrate this vulnerability, we propose the Spectral Masking and Interpolation Attack (SMIA), a novel method that strategically manipulates inaudible frequency regions of AI-generated audio. By altering the voice in imperceptible zones to the human ear, SMIA creates adversarial samples that sound authentic while deceiving CMs. We conducted a comprehensive evaluation of our attack against state-of-the-art (SOTA) models across multiple tasks, under simulated real-world conditions. SMIA achieved a strong attack success rate (ASR) of at least 82% against combined VAS/CM systems, at least 97.5% against standalone speaker verification systems, and 100% against countermeasures. These findings conclusively demonstrate that current security postures are insufficient against adaptive adversarial attacks. This work highlights the urgent need for a paradigm shift toward next-generation defenses that employ dynamic, context-aware frameworks capable of evolving with the threat landscape.

翻译：语音认证系统（VAS）利用独特的声学特征进行身份验证，并日益广泛地集成于银行、医疗等高安全需求领域。尽管深度学习技术提升了其性能，此类系统仍面临深度伪造与对抗攻击等复杂威胁带来的严重安全漏洞。逼真语音克隆技术的出现使检测更为困难，系统难以区分真实与合成音频。虽然已有反欺骗对策（CMs）用于缓解此类风险，但多数依赖静态检测模型，易被新型对抗方法绕过，形成关键安全缺口。为揭示此漏洞，本文提出频谱掩蔽与插值攻击（SMIA）——一种通过策略性操纵AI生成音频中人耳不可听频段的新型攻击方法。SMIA通过修改人耳难以感知的频谱区域，生成听觉真实但能欺骗CMs的对抗样本。我们在模拟真实场景下，针对多任务前沿（SOTA）模型进行了全面评估。实验表明：SMIA对VAS/CM联合系统的攻击成功率（ASR）至少达82%，对独立说话人验证系统至少达97.5%，对反欺骗对策达到100%。这些结果确证当前安全机制难以抵御自适应对抗攻击。本研究强调亟需范式转变，发展具有动态性与情境感知能力的下一代防御框架，以应对持续演化的威胁态势。