Speech recognition systems driven by DNNs have revolutionized human-computer interaction through voice interfaces, which significantly facilitate our daily lives. However, the growing popularity of these systems also raises special concerns on their security, particularly regarding backdoor attacks. A backdoor attack inserts one or more hidden backdoors into a DNN model during its training process, such that it does not affect the model's performance on benign inputs, but forces the model to produce an adversary-desired output if a specific trigger is present in the model input. Despite the initial success of current audio backdoor attacks, they suffer from the following limitations: (i) Most of them require sufficient knowledge, which limits their widespread adoption. (ii) They are not stealthy enough, thus easy to be detected by humans. (iii) Most of them cannot attack live speech, reducing their practicality. To address these problems, in this paper, we propose FlowMur, a stealthy and practical audio backdoor attack that can be launched with limited knowledge. FlowMur constructs an auxiliary dataset and a surrogate model to augment adversary knowledge. To achieve dynamicity, it formulates trigger generation as an optimization problem and optimizes the trigger over different attachment positions. To enhance stealthiness, we propose an adaptive data poisoning method according to Signal-to-Noise Ratio (SNR). Furthermore, ambient noise is incorporated into the process of trigger generation and data poisoning to make FlowMur robust to ambient noise and improve its practicality. Extensive experiments conducted on two datasets demonstrate that FlowMur achieves high attack performance in both digital and physical settings while remaining resilient to state-of-the-art defenses. In particular, a human study confirms that triggers generated by FlowMur are not easily detected by participants.
翻译:基于深度神经网络的语音识别系统通过语音接口彻底改变了人机交互方式,显著便利了日常生活。然而,这类系统的日益普及也引发了对其安全性的特别关注,尤其是后门攻击。后门攻击在深度神经网络模型训练过程中植入一个或多个隐藏后门,使得模型对良性输入的性能不受影响,但若模型输入中存在特定触发器,则强制模型产生攻击者期望的输出。尽管当前音频后门攻击已取得初步成功,但它们存在以下局限:(i)大多数方法需要充分的知识,限制了其广泛应用;(ii)隐蔽性不足,容易被人类察觉;(iii)多数方法无法攻击实时语音,降低了实用性。为解决上述问题,本文提出FlowMur——一种可在有限知识下发起的隐蔽且实用的音频后门攻击。FlowMur通过构建辅助数据集和替身模型来增强攻击者知识。为实现动态性,它将触发器生成形式化为优化问题,并针对不同附着位置优化触发器。为增强隐蔽性,我们提出基于信噪比的自适应数据投毒方法。此外,在触发器生成和数据投毒过程中融入环境噪声,使FlowMur对环境噪声具有鲁棒性并提升其实用性。在两个数据集上开展的大量实验表明,FlowMur在数字和物理场景下均能实现高攻击性能,同时对现有最先进的防御方法保持鲁棒性。特别地,一项人类研究证实FlowMur生成的触发器不易被参与者察觉。