Automatic Speaker Recognition Systems (SRSs) have been widely used in voice applications for personal identification and access control. A typical SRS consists of three stages, i.e., training, enrollment, and recognition. Previous work has revealed that SRSs can be bypassed by backdoor attacks at the training stage or by adversarial example attacks at the recognition stage. In this paper, we propose TUNER, a new type of backdoor attack against the enrollment stage of SRS via adversarial ultrasound modulation, which is inaudible, synchronization-free, content-independent, and black-box. Our key idea is to first inject the backdoor into the SRS with modulated ultrasound when a legitimate user initiates the enrollment, and afterward, the polluted SRS will grant access to both the legitimate user and the adversary with high confidence. Our attack faces a major challenge of unpredictable user articulation at the enrollment stage. To overcome this challenge, we generate the ultrasonic backdoor by augmenting the optimization process with random speech content, vocalizing time, and volume of the user. Furthermore, to achieve real-world robustness, we improve the ultrasonic signal over traditional methods using sparse frequency points, pre-compensation, and single-sideband (SSB) modulation. We extensively evaluate TUNER on two common datasets and seven representative SRS models. Results show that our attack can successfully bypass speaker recognition systems while remaining robust to various speakers, speech content, et
翻译:自动说话人识别系统(SRS)已广泛应用于语音应用中的身份识别和访问控制。典型的SRS包含三个阶段,即训练、注册和识别阶段。先前的研究表明,SRS可能因训练阶段的后门攻击或识别阶段的对抗样本攻击而被绕过。本文提出TUNER——一种通过对抗性超声波调制针对SRS注册阶段的新型后门攻击,该攻击具有不可闻、无需同步、内容无关和黑盒特性。我们的核心思想是在合法用户发起注册时,首先通过调制超声波将后门注入SRS,之后受污染的SRS将以高置信度允许合法用户和攻击者访问。我们的攻击面临注册阶段用户发音不可预测的重大挑战。为克服这一挑战,我们通过随机语音内容、发声时间和用户音量来增强优化过程,从而生成超声后门。此外,为实现现实世界中的鲁棒性,我们采用稀疏频点、预补偿和单边带(SSB)调制等传统方法的改进方案优化超声波信号。我们在两个通用数据集和七种代表性SRS模型上对TUNER进行了广泛评估。结果表明,我们的攻击能成功绕过说话人识别系统,同时保持对不同说话人、语音内容等要素的鲁棒性。