Speech recognition is an essential start ring of human-computer interaction, and recently, deep learning models have achieved excellent success in this task. However, when the model training and private data provider are always separated, some security threats that make deep neural networks (DNNs) abnormal deserve to be researched. In recent years, the typical backdoor attacks have been researched in speech recognition systems. The existing backdoor methods are based on data poisoning. The attacker adds some incorporated changes to benign speech spectrograms or changes the speech components, such as pitch and timbre. As a result, the poisoned data can be detected by human hearing or automatic deep algorithms. To improve the stealthiness of data poisoning, we propose a non-neural and fast algorithm called Random Spectrogram Rhythm Transformation (RSRT) in this paper. The algorithm combines four steps to generate stealthy poisoned utterances. From the perspective of rhythm component transformation, our proposed trigger stretches or squeezes the mel spectrograms and recovers them back to signals. The operation keeps timbre and content unchanged for good stealthiness. Our experiments are conducted on two kinds of speech recognition tasks, including testing the stealthiness of poisoned samples by speaker verification and automatic speech recognition. The results show that our method has excellent effectiveness and stealthiness. The rhythm trigger needs a low poisoning rate and gets a very high attack success rate.
翻译:语音识别是人机交互的关键起始环节,近年来深度学习模型在该任务中取得了卓越成就。然而,当模型训练与私有数据提供方始终分离时,导致深度神经网络(DNNs)异常的一些安全威胁值得深入研究。近年来,典型后门攻击已在语音识别系统中得到探索。现有后门方法基于数据投毒,攻击者对良性语音频谱图添加特定修改或改变语音成分(如音高和音色),导致投毒数据可能被人耳或自动深度算法检测。为提升数据投毒的隐蔽性,本文提出一种非神经网络的快速算法——随机频谱图节奏变换(RSRT)。该算法通过四个步骤生成隐蔽的投毒语音样本。从节奏成分变换的角度,我们提出的触发器对梅尔频谱图进行拉伸或压缩后还原为信号,该操作保持音色与内容不变以实现良好隐蔽性。我们在两类语音识别任务上进行实验,包括通过说话人验证和自动语音识别测试投毒样本的隐蔽性。结果表明,本方法具有优异的有效性与隐蔽性:节奏触发器仅需较低投毒率即可获得极高的攻击成功率。