Compensating Removed Frequency Components: Thwarting Voice Spectrum Reduction Attacks

Automatic speech recognition (ASR) provides diverse audio-to-text services for humans to communicate with machines. However, recent research reveals ASR systems are vulnerable to various malicious audio attacks. In particular, by removing the non-essential frequency components, a new spectrum reduction attack can generate adversarial audios that can be perceived by humans but cannot be correctly interpreted by ASR systems. It raises a new challenge for content moderation solutions to detect harmful content in audio and video available on social media platforms. In this paper, we propose an acoustic compensation system named ACE to counter the spectrum reduction attacks over ASR systems. Our system design is based on two observations, namely, frequency component dependencies and perturbation sensitivity. First, since the Discrete Fourier Transform computation inevitably introduces spectral leakage and aliasing effects to the audio frequency spectrum, the frequency components with similar frequencies will have a high correlation. Thus, considering the intrinsic dependencies between neighboring frequency components, it is possible to recover more of the original audio by compensating for the removed components based on the remaining ones. Second, since the removed components in the spectrum reduction attacks can be regarded as an inverse of adversarial noise, the attack success rate will decrease when the adversarial audio is replayed in an over-the-air scenario. Hence, we can model the acoustic propagation process to add over-the-air perturbations into the attacked audio. We implement a prototype of ACE and the experiments show ACE can effectively reduce up to 87.9% of ASR inference errors caused by spectrum reduction attacks. Also, by analyzing residual errors, we summarize six general types of ASR inference errors and investigate the error causes and potential mitigation solutions.

翻译：自动语音识别（ASR）为人类与机器交互提供了多样化的音频到文本服务。然而，近期研究表明ASR系统易受各种恶意音频攻击。特别是，通过移除非必要频率成分，一种新型频谱缩减攻击能够生成可被人类感知但无法被ASR系统正确解读的对抗性音频。这为社交媒体平台上音频和视频有害内容检测的内容审核方案带来了新挑战。本文提出名为ACE的声学补偿系统，以对抗针对ASR系统的频谱缩减攻击。我们的系统设计基于两个观测结果：频率成分依赖性和扰动敏感性。首先，由于离散傅里叶变换计算不可避免地引入频谱泄漏和混叠效应，频率相近的成分具有高度相关性。因此，利用相邻频率成分的内在依赖性，可以基于剩余成分补偿被移除成分，从而恢复更多原始音频。其次，由于频谱缩减攻击中移除的成分可视为对抗噪声的逆过程，当对抗性音频在无需信道场景中回放时，攻击成功率将下降。因此，我们可以对声学传播过程建模，以为被攻击音频添加无需信道扰动。我们实现了ACE的原型系统，实验表明ACE能有效减少高达87.9%由频谱缩减攻击导致的ASR推理错误。此外，通过分析残留错误，我们归纳出六类通用ASR推理错误，并探究了错误成因及潜在缓解方案。