In a transfer-based attack against Automatic Speech Recognition (ASR) systems, attacks are unable to access the architecture and parameters of the target model. Existing attack methods are mostly investigated in voice assistant scenarios with restricted voice commands, prohibiting their applicability to more general ASR related applications. To tackle this challenge, we propose a novel contextualized attack with deletion, insertion, and substitution adversarial behaviors, namely TransAudio, which achieves arbitrary word-level attacks based on the proposed two-stage framework. To strengthen the attack transferability, we further introduce an audio score-matching optimization strategy to regularize the training process, which mitigates adversarial example over-fitting to the surrogate model. Extensive experiments and analysis demonstrate the effectiveness of TransAudio against open-source ASR models and commercial APIs.
翻译:在针对自动语音识别系统的迁移攻击中,攻击者无法访问目标模型的架构和参数。现有攻击方法主要研究受限语音命令的语音助手场景,限制了其在更通用的ASR相关应用中的适用性。为解决这一挑战,我们提出了一种新颖的情境化攻击方法,包含删除、插入和替换三种对抗行为,即TransAudio,基于所提出的两阶段框架实现任意单词级别的攻击。为增强攻击的可迁移性,我们进一步引入音频得分匹配优化策略来规范训练过程,从而减轻对抗样本对替代模型的过拟合。大量实验与分析表明,TransAudio对抗开源ASR模型及商业API均具有有效性。