Given the extensive research and real-world applications of automatic speech recognition (ASR), ensuring the robustness of ASR models against minor input perturbations becomes a crucial consideration for maintaining their effectiveness in real-time scenarios. Previous explorations into ASR model robustness have predominantly revolved around evaluating accuracy on white-box settings with full access to ASR models. Nevertheless, full ASR model details are often not available in real-world applications. Therefore, evaluating the robustness of black-box ASR models is essential for a comprehensive understanding of ASR model resilience. In this regard, we thoroughly study the vulnerability of practical black-box attacks in cutting-edge ASR models and propose to employ two advanced time-domain-based transferable attacks alongside our differentiable feature extractor. We also propose a speech-aware gradient optimization approach (SAGO) for ASR, which forces mistranscription with minimal impact on human imperceptibility through voice activity detection rule and a speech-aware gradient-oriented optimizer. Our comprehensive experimental results reveal performance enhancements compared to baseline approaches across five models on two databases.
翻译:鉴于自动语音识别(ASR)的广泛研究和实际应用,确保ASR模型对微小输入扰动的鲁棒性成为维持其在实时场景中有效性的关键考量。先前对ASR模型鲁棒性的探索主要围绕在完全访问ASR模型的白盒设置下评估其准确性。然而,在实际应用中通常无法获取完整的ASR模型细节。因此,评估黑盒ASR模型的鲁棒性对于全面理解ASR模型的抗干扰能力至关重要。为此,我们深入研究了前沿ASR模型中实际黑盒攻击的脆弱性,并提出结合可微分特征提取器,采用两种先进的基于时域的可迁移攻击方法。我们还提出了一种面向ASR的语音感知梯度优化方法(SAGO),该方法通过语音活动检测规则和语音感知梯度导向优化器,以对人类感知影响最小的方式强制产生误识别。我们在两个数据库的五个模型上进行的全面实验结果表明,相较于基线方法,所提方法实现了性能提升。