In this paper, we propose PhantomSound, a query-efficient black-box attack toward voice assistants. Existing black-box adversarial attacks on voice assistants either apply substitution models or leverage the intermediate model output to estimate the gradients for crafting adversarial audio samples. However, these attack approaches require a significant amount of queries with a lengthy training stage. PhantomSound leverages the decision-based attack to produce effective adversarial audios, and reduces the number of queries by optimizing the gradient estimation. In the experiments, we perform our attack against 4 different speech-to-text APIs under 3 real-world scenarios to demonstrate the real-time attack impact. The results show that PhantomSound is practical and robust in attacking 5 popular commercial voice controllable devices over the air, and is able to bypass 3 liveness detection mechanisms with >95% success rate. The benchmark result shows that PhantomSound can generate adversarial examples and launch the attack in a few minutes. We significantly enhance the query efficiency and reduce the cost of a successful untargeted and targeted adversarial attack by 93.1% and 65.5% compared with the state-of-the-art black-box attacks, using merely ~300 queries (~5 minutes) and ~1,500 queries (~25 minutes), respectively.
翻译:本文提出PhantomSound,一种面向语音助手的查询高效黑盒攻击方法。现有针对语音助手的黑盒对抗攻击要么采用替代模型,要么利用中间模型输出估计梯度以生成对抗性音频样本。然而,这些攻击方法需要大量查询且训练阶段耗时过长。PhantomSound利用基于决策的攻击生成有效对抗音频,并通过优化梯度估计减少查询次数。实验中,我们在3个真实场景下对4种不同语音转文本API实施攻击,验证了其实时攻击效果。结果表明,PhantomSound在无线环境下攻击5种主流商用语音控制设备时具有实用性和鲁棒性,并以超过95%的成功率绕过3种活体检测机制。基准测试显示,PhantomSound可在数分钟内生成对抗样本并发起攻击。与现有最先进的黑盒攻击相比,我们显著提升了查询效率,将非定向与定向对抗攻击成功成本分别降低93.1%和65.5%,仅需约300次查询(约5分钟)和约1500次查询(约25分钟)。