In light of the widespread application of Automatic Speech Recognition (ASR) systems, their security concerns have received much more attention than ever before, primarily due to the susceptibility of Deep Neural Networks. Previous studies have illustrated that surreptitiously crafting adversarial perturbations enables the manipulation of speech recognition systems, resulting in the production of malicious commands. These attack methods mostly require adding noise perturbations under $\ell_p$ norm constraints, inevitably leaving behind artifacts of manual modifications. Recent research has alleviated this limitation by manipulating style vectors to synthesize adversarial examples based on Text-to-Speech (TTS) synthesis audio. However, style modifications based on optimization objectives significantly reduce the controllability and editability of audio styles. In this paper, we propose an attack on ASR systems based on user-customized style transfer. We first test the effect of Style Transfer Attack (STA) which combines style transfer and adversarial attack in sequential order. And then, as an improvement, we propose an iterative Style Code Attack (SCA) to maintain audio quality. Experimental results show that our method can meet the need for user-customized styles and achieve a success rate of 82% in attacks, while keeping sound naturalness due to our user study.
翻译:鉴于自动语音识别系统的广泛应用,其安全问题因深度神经网络的易受攻击性而受到空前关注。先前研究表明,隐蔽地构建对抗性扰动可操控语音识别系统,进而生成恶意指令。这些攻击方法大多需要在 $\ell_p$ 范数约束下添加噪声扰动,不可避免地留下人工修改痕迹。近期的研究通过操控风格向量,基于文本转语音合成音频生成对抗样本,缓解了这一限制。然而,基于优化目标的风格修改显著降低了音频风格的可控性与可编辑性。本文提出一种基于用户自定义风格迁移的自动语音识别系统攻击方法。首先测试按序组合风格迁移与对抗攻击的风格迁移攻击效果,进而改进提出迭代式风格编码攻击以保持音频质量。实验结果表明,本方法能满足用户自定义风格需求,在用户研究中实现82%的攻击成功率,同时保持声音自然度。