Speech contains rich information on the emotions of humans, and Speech Emotion Recognition (SER) has been an important topic in the area of human-computer interaction. The robustness of SER models is crucial, particularly in privacy-sensitive and reliability-demanding domains like private healthcare. Recently, the vulnerability of deep neural networks in the audio domain to adversarial attacks has become a popular area of research. However, prior works on adversarial attacks in the audio domain primarily rely on iterative gradient-based techniques, which are time-consuming and prone to overfitting the specific threat model. Furthermore, the exploration of sparse perturbations, which have the potential for better stealthiness, remains limited in the audio domain. To address these challenges, we propose a generator-based attack method to generate sparse and transferable adversarial examples to deceive SER models in an end-to-end and efficient manner. We evaluate our method on two widely-used SER datasets, Database of Elicited Mood in Speech (DEMoS) and Interactive Emotional dyadic MOtion CAPture (IEMOCAP), and demonstrate its ability to generate successful sparse adversarial examples in an efficient manner. Moreover, our generated adversarial examples exhibit model-agnostic transferability, enabling effective adversarial attacks on advanced victim models.
翻译:语音蕴含人类丰富的情绪信息,语音情感识别(SER)已成为人机交互领域的重要研究方向。SER模型的鲁棒性至关重要,特别是在隐私敏感和可靠性要求高的领域(如私人医疗)。近年来,深度神经网络在音频领域对抗攻击的脆弱性已成为研究热点。然而,现有音频领域的对抗攻击方法主要依赖基于迭代梯度的技术,这类方法耗时且易过度拟合特定威胁模型。此外,具有潜在更优隐蔽性的稀疏扰动在音频领域的研究仍存在局限性。为解决上述问题,我们提出了一种基于生成器的攻击方法,能够以端到端的高效方式生成稀疏且可迁移的对抗样本以欺骗SER模型。在语音情感诱发数据库(DEMoS)和交互式情感动态运动捕捉数据库(IEMOCAP)这两个广泛使用的SER数据集上,我们验证了该方法能高效生成成功的稀疏对抗样本。此外,生成的对抗样本展现出与模型无关的可迁移性,可对先进受害者模型实施有效对抗攻击。