Inaudible Adversarial Perturbation: Manipulating the Recognition of User Speech in Real Time

Automatic speech recognition (ASR) systems have been shown to be vulnerable to adversarial examples (AEs). Recent success all assumes that users will not notice or disrupt the attack process despite the existence of music/noise-like sounds and spontaneous responses from voice assistants. Nonetheless, in practical user-present scenarios, user awareness may nullify existing attack attempts that launch unexpected sounds or ASR usage. In this paper, we seek to bridge the gap in existing research and extend the attack to user-present scenarios. We propose VRIFLE, an inaudible adversarial perturbation (IAP) attack via ultrasound delivery that can manipulate ASRs as a user speaks. The inherent differences between audible sounds and ultrasounds make IAP delivery face unprecedented challenges such as distortion, noise, and instability. In this regard, we design a novel ultrasonic transformation model to enhance the crafted perturbation to be physically effective and even survive long-distance delivery. We further enable VRIFLE's robustness by adopting a series of augmentation on user and real-world variations during the generation process. In this way, VRIFLE features an effective real-time manipulation of the ASR output from different distances and under any speech of users, with an alter-and-mute strategy that suppresses the impact of user disruption. Our extensive experiments in both digital and physical worlds verify VRIFLE's effectiveness under various configurations, robustness against six kinds of defenses, and universality in a targeted manner. We also show that VRIFLE can be delivered with a portable attack device and even everyday-life loudspeakers.

翻译：自动语音识别（ASR）系统已被证实易受对抗样本攻击。现有成功方法均假设用户不会察觉或干扰攻击过程，尽管攻击会伴随类似音乐或噪声的声音，并触发语音助手的自发响应。然而在实际用户在场场景中，用户警觉性可能使现有攻击手段失效——这些攻击会突然发出异常声音或异常启动ASR。本文旨在弥合现有研究空白，将攻击扩展至用户在场场景。我们提出VRIFLE——一种通过超声波传输的听不见对抗扰动（IAP）攻击，可在用户说话时操控ASR系统。可听声与超声波之间的固有差异使IAP传输面临畸变、噪声和不稳定性等前所未有的挑战。为此，我们设计新型超声波变换模型，使生成的扰动具备物理有效性，甚至可在长距离传输中保持稳定。通过在扰动生成过程中引入针对用户及真实环境变化的系列增强技术，进一步提升VRIFLE的鲁棒性。基于此，VRIFLE在任意用户语音场景下，均能以"切换-静音"策略在不同距离实现实时有效操控ASR输出，同时抑制用户干扰的影响。我们在数字世界与物理世界的广泛实验验证了VRIFLE在多配置下的有效性、对抗六种防御的鲁棒性，以及定向攻击的通用性。我们还证实VRIFLE可通过便携攻击设备甚至日常扬声器进行传输。