Speaker identification (SI) determines a speaker's identity based on their spoken utterances. Previous work indicates that SI deep neural networks (DNNs) are vulnerable to backdoor attacks. Backdoor attacks involve embedding hidden triggers in DNNs' training data, causing the DNN to produce incorrect output when these triggers are present during inference. This is the first work that explores SI DNNs' vulnerability to backdoor attacks using speakers' emotional prosody, resulting in dynamic, inconspicuous triggers. %Such an attack could have real-world implications in forensics, authentication, and surveillance. We conducted a parameter study using three different datasets and DNN architectures to determine the impact of emotions as backdoor triggers on the accuracy of SI systems. Additionally, we have explored the robustness of our attacks by applying defenses like pruning, STRIP-ViTA, and three popular preprocessing techniques: quantization, median filtering, and squeezing. Our findings show that the aforementioned models are prone to our attack, indicating that emotional triggers (sad and neutral prosody) can be effectively used to compromise the integrity of SI systems. However, the results of our pruning experiments suggest potential solutions for reinforcing the models against our attacks, decreasing the attack success rate up to 40%.
翻译:说话人识别(SI)通过分析说话人的语音片段来确定其身份。先前研究表明,基于深度神经网络(DNN)的说话人识别系统易受后门攻击。后门攻击通过在DNN训练数据中嵌入隐蔽触发器,使得模型在推理阶段遇到含触发器的输入时产生错误输出。本研究首次探索了利用说话人情感韵律作为动态、隐蔽的触发器,对说话人识别DNN实施后门攻击的可行性。我们采用三种不同数据集和DNN架构进行参数研究,以评估情感作为后门触发器对说话人识别系统准确率的影响。此外,通过应用剪枝防御、STRIP-ViTA防御以及三种主流预处理技术(量化、中值滤波和压缩),我们检验了攻击的鲁棒性。实验结果表明,上述模型均易受本攻击影响,证实情感触发器(悲伤与中性韵律)可有效破坏说话人识别系统的完整性。然而,剪枝实验显示存在强化模型防御的潜在方案,能将攻击成功率降低最高达40%。