Audio adversarial examples are audio files that have been manipulated to fool an automatic speech recognition (ASR) system, while still sounding benign to a human listener. Most methods to generate such samples are based on a two-step algorithm: first, a viable adversarial audio file is produced, then, this is fine-tuned with respect to perceptibility and robustness. In this work, we present an integrated algorithm that uses psychoacoustic models and room impulse responses (RIR) in the generation step. The RIRs are dynamically created by a neural network during the generation process to simulate a physical environment to harden our examples against transformations experienced in over-the-air attacks. We compare the different approaches in three experiments: in a simulated environment and in a realistic over-the-air scenario to evaluate the robustness, and in a human study to evaluate the perceptibility. Our algorithms considering psychoacoustics only or in addition to the robustness show an improvement in the signal-to-noise ratio (SNR) as well as in the human perception study, at the cost of an increased word error rate (WER).
翻译:音频对抗样本是经过特意设计的音频文件,旨在欺骗自动语音识别(ASR)系统,同时对人耳来说听起来仍然正常。生成此类样本的大多数方法基于两步算法:首先,生成可行的对抗性音频文件;然后,针对感知性和鲁棒性对其进行微调。在本工作中,我们提出了一种集成算法,该算法在生成步骤中结合了心理声学模型和房间脉冲响应(RIR)。在生成过程中,通过神经网络动态创建RIR以模拟物理环境,从而增强我们的样本对空中传输攻击中遭遇的变换的鲁棒性。我们通过三个实验比较了不同方法:在模拟环境中和真实空中传输场景下评估鲁棒性,并通过人类研究评估可感知性。仅考虑心理声学或同时考虑鲁棒性的算法在信噪比(SNR)和人类感知研究中均表现出改善,但以词错误率(WER)的增加为代价。