Despite remarkable improvements, automatic speech recognition is susceptible to adversarial perturbations. Compared to standard machine learning architectures, these attacks are significantly more challenging, especially since the inputs to a speech recognition system are time series that contain both acoustic and linguistic properties of speech. Extracting all recognition-relevant information requires more complex pipelines and an ensemble of specialized components. Consequently, an attacker needs to consider the entire pipeline. In this paper, we present VENOMAVE, the first training-time poisoning attack against speech recognition. Similar to the predominantly studied evasion attacks, we pursue the same goal: leading the system to an incorrect and attacker-chosen transcription of a target audio waveform. In contrast to evasion attacks, however, we assume that the attacker can only manipulate a small part of the training data without altering the target audio waveform at runtime. We evaluate our attack on two datasets: TIDIGITS and Speech Commands. When poisoning less than 0.17% of the dataset, VENOMAVE achieves attack success rates of more than 80.0%, without access to the victim's network architecture or hyperparameters. In a more realistic scenario, when the target audio waveform is played over the air in different rooms, VENOMAVE maintains a success rate of up to 73.3%. Finally, VENOMAVE achieves an attack transferability rate of 36.4% between two different model architectures.
翻译:尽管自动语音识别取得了显著进步,但其仍易受对抗性扰动的影响。与标准机器学习架构相比,此类攻击更具挑战性,尤其是因为语音识别系统的输入是包含语音声学与语言特征的时序信号。提取所有与识别相关的信息需要更复杂的流水线及多个专用组件的集成。因此,攻击者必须考虑整个流水线。本文提出VENOMAVE——首个针对语音识别的训练时投毒攻击。与主流研究的规避攻击类似,我们追求相同目标:使系统对目标音频波形产生错误且由攻击者指定的转录结果。与规避攻击不同的是,我们假设攻击者仅能操纵少量训练数据,而无需在运行时修改目标音频波形。我们在TIDIGITS和Speech Commands两个数据集上评估了该攻击。当投毒量不足数据集的0.17%时,VENOMAVE的攻击成功率超过80.0%,且无需知晓受害者的网络架构或超参数。在更真实的场景中,当目标音频波形在不同房间内通过空气传播时,VENOMAVE仍能维持最高73.3%的成功率。此外,VENOMAVE在两种不同模型架构之间的攻击迁移率可达36.4%。