Enrollment-stage Backdoor Attacks on Speaker Recognition Systems via Adversarial Ultrasound

Automatic Speaker Recognition Systems (SRSs) have been widely used in voice applications for personal identification and access control. A typical SRS consists of three stages, i.e., training, enrollment, and recognition. Previous work has revealed that SRSs can be bypassed by backdoor attacks at the training stage or by adversarial example attacks at the recognition stage. In this paper, we propose TUNER, a new type of backdoor attack against the enrollment stage of SRS via adversarial ultrasound modulation, which is inaudible, synchronization-free, content-independent, and black-box. Our key idea is to first inject the backdoor into the SRS with modulated ultrasound when a legitimate user initiates the enrollment, and afterward, the polluted SRS will grant access to both the legitimate user and the adversary with high confidence. Our attack faces a major challenge of unpredictable user articulation at the enrollment stage. To overcome this challenge, we generate the ultrasonic backdoor by augmenting the optimization process with random speech content, vocalizing time, and volume of the user. Furthermore, to achieve real-world robustness, we improve the ultrasonic signal over traditional methods using sparse frequency points, pre-compensation, and single-sideband (SSB) modulation. We extensively evaluate TUNER on two common datasets and seven representative SRS models. Results show that our attack can successfully bypass speaker recognition systems while remaining robust to various speakers, speech content, et

翻译：自动说话人识别系统（SRS）已广泛应用于语音应用中的身份识别和访问控制。典型的SRS包含三个阶段，即训练、注册和识别阶段。先前的研究表明，SRS可能因训练阶段的后门攻击或识别阶段的对抗样本攻击而被绕过。本文提出TUNER——一种通过对抗性超声波调制针对SRS注册阶段的新型后门攻击，该攻击具有不可闻、无需同步、内容无关和黑盒特性。我们的核心思想是在合法用户发起注册时，首先通过调制超声波将后门注入SRS，之后受污染的SRS将以高置信度允许合法用户和攻击者访问。我们的攻击面临注册阶段用户发音不可预测的重大挑战。为克服这一挑战，我们通过随机语音内容、发声时间和用户音量来增强优化过程，从而生成超声后门。此外，为实现现实世界中的鲁棒性，我们采用稀疏频点、预补偿和单边带（SSB）调制等传统方法的改进方案优化超声波信号。我们在两个通用数据集和七种代表性SRS模型上对TUNER进行了广泛评估。结果表明，我们的攻击能成功绕过说话人识别系统，同时保持对不同说话人、语音内容等要素的鲁棒性。

相关内容

声纹识别

关注 444

说话人识别（Speaker Recognition），或者称为声纹识别（Voiceprint Recognition, VPR），是根据语音中所包含的说话人个性信息，利用计算机以及现在的信息识别技术，自动鉴别说话人身份的一种生物特征识别技术。说话人识别研究的目的就是从语音中提取具有说话人表征性的特征，建立有效的模型和系统，实现自动精准的说话人鉴别。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日