Inaudible Adversarial Perturbation: Manipulating the Recognition of User Speech in Real Tim

Automatic speech recognition (ASR) systems have been shown to be vulnerable to adversarial examples (AEs). Recent success all assumes that users will not notice or disrupt the attack process despite the existence of music/noise-like sounds and spontaneous responses from voice assistants. Nonetheless, in practical user-present scenarios, user awareness may nullify existing attack attempts that launch unexpected sounds or ASR usage. In this paper, we seek to bridge the gap in existing research and extend the attack to user-present scenarios. We propose VRIFLE, an inaudible adversarial perturbation (IAP) attack via ultrasound delivery that can manipulate ASRs as a user speaks. The inherent differences between audible sounds and ultrasounds make IAP delivery face unprecedented challenges such as distortion, noise, and instability. In this regard, we design a novel ultrasonic transformation model to enhance the crafted perturbation to be physically effective and even survive long-distance delivery. We further enable VRIFLE's robustness by adopting a series of augmentation on user and real-world variations during the generation process. In this way, VRIFLE features an effective real-time manipulation of the ASR output from different distances and under any speech of users, with an alter-and-mute strategy that suppresses the impact of user disruption. Our extensive experiments in both digital and physical worlds verify VRIFLE's effectiveness under various configurations, robustness against six kinds of defenses, and universality in a targeted manner. We also show that VRIFLE can be delivered with a portable attack device and even everyday-life loudspeakers.

翻译：自动语音识别（ASR）系统已被证明容易受到对抗样本（AEs）的攻击。近期研究成果均假设用户不会注意到或干扰攻击过程，尽管存在类似音乐/噪音的声音以及语音助手的自发响应。然而，在实际的用户在场场景中，用户的察觉可能使现有攻击尝试（即发出意外声音或使用ASR）失效。本文旨在弥合现有研究的这一空白，将攻击扩展至用户在场场景。我们提出VRIFLE，一种通过超声传输实现的不可听对抗扰动（IAP）攻击，能够在用户说话时操纵ASR。可听声音与超声之间的固有差异使IAP传输面临前所未有的挑战，如失真、噪声和不稳定性。为此，我们设计了一种新颖的超声变换模型，使精心构建的扰动在物理世界中有效，甚至能在远距离传输中存活。我们还通过在生成过程中对用户和真实场景变化采用一系列增强手段，进一步提升了VRIFLE的鲁棒性。通过这种方式，VRIFLE能够以交替静音与篡改的策略，在不同距离和用户任意语音下实时有效操纵ASR输出，同时抑制用户干扰的影响。我们在数字世界和物理世界中进行的广泛实验验证了VRIFLE在各种配置下的有效性、对六类防御手段的鲁棒性以及目标攻击的通用性。我们还展示了VRIFLE可通过便携式攻击设备甚至日常音箱进行传输。