Voice authentication on IoT-enabled smart devices has gained prominence in recent years due to increasing concerns over user privacy and security. The current authentication systems are vulnerable to different voice-spoofing attacks (e.g., replay, voice cloning, and audio deepfakes) that mimic legitimate voices to deceive authentication systems and enable fraudulent activities (e.g., impersonation, unauthorized access, financial fraud, etc.). Existing solutions are often designed to tackle a single type of attack, leading to compromised performance against unseen attacks. On the other hand, existing unified voice anti-spoofing solutions, not designed specifically for IoT, possess complex architectures and thus cannot be deployed on IoT-enabled smart devices. Additionally, most of these unified solutions exhibit significant performance issues, including higher equal error rates or lower accuracy for specific attacks. To overcome these issues, we present the parallel stacked aggregation network (PSA-Net), a lightweight framework designed as an anti-spoofing defense system for voice-controlled smart IoT devices. The PSA-Net processes raw audios directly and eliminates the need for dataset-dependent handcrafted features or pre-computed spectrograms. Furthermore, PSA-Net employs a split-transform-aggregate approach, which involves the segmentation of utterances, the extraction of intrinsic differentiable embeddings through convolutions, and the aggregation of them to distinguish legitimate from spoofed audios. In contrast to existing deep Resnet-oriented solutions, we incorporate cardinality as an additional dimension in our network, which enhances the PSA-Net ability to generalize across diverse attacks. The results show that the PSA-Net achieves more consistent performance for different attacks that exist in current anti-spoofing solutions.
翻译:近年来,随着用户隐私与安全问题的日益凸显,物联网智能设备上的语音认证技术受到广泛关注。现有认证系统易受各类语音欺骗攻击(如重放、语音克隆、音频深度伪造等)的影响,这些攻击通过模仿合法语音欺骗认证系统,进而实施欺诈活动(如身份冒充、未授权访问、金融诈骗等)。现有解决方案通常仅针对单一攻击类型设计,导致在面对未知攻击时性能显著下降。另一方面,现有统一语音反欺骗方案并非专为物联网设备设计,其架构复杂,难以部署于资源受限的物联网智能设备。此外,多数统一方案存在明显的性能缺陷,例如对特定攻击具有较高的等错误率或较低的识别准确率。为解决上述问题,本文提出并行堆叠聚合网络(PSA-Net),这是一种专为语音控制型智能物联网设备设计的轻量级反欺骗防御框架。PSA-Net直接处理原始音频数据,无需依赖数据集的手工特征提取或预计算声谱图。该网络采用“分割-变换-聚合”处理流程:首先对语音片段进行分割,通过卷积操作提取本质可微分嵌入特征,继而聚合这些特征以区分合法与伪造音频。与现有基于深度残差网络的方案不同,本网络引入基数维度作为扩展参数,增强了PSA-Net应对多样化攻击的泛化能力。实验结果表明,相较于现有反欺骗方案,PSA-Net在面对不同攻击类型时展现出更稳定的性能表现。