Adversarial perturbations in speech pose a serious threat to automatic speech recognition (ASR) and speaker verification by introducing subtle waveform modifications that remain imperceptible to humans but can significantly alter system outputs. While targeted attacks on end-to-end ASR models have been widely studied, the phonetic basis of these perturbations and their effect on speaker identity remain underexplored. In this work, we analyze adversarial audio at the phonetic level and show that perturbations exploit systematic confusions such as vowel centralization and consonant substitutions. These distortions not only mislead transcription but also degrade phonetic cues critical for speaker verification, leading to identity drift. Using DeepSpeech as our ASR target, we generate targeted adversarial examples and evaluate their impact on speaker embeddings across genuine and impostor samples. Results across 16 phonetically diverse target phrases demonstrate that adversarial audio induces both transcription errors and identity drift, highlighting the need for phonetic-aware defenses to ensure the robustness of ASR and speaker recognition systems.
翻译:语音中的对抗扰动通过引入细微的波形修改,对人类听觉不可察觉,却能显著改变系统输出,对自动语音识别和说话人验证构成严重威胁。尽管针对端到端ASR模型的定向攻击已得到广泛研究,但这些扰动的语音学基础及其对说话人身份的影响仍未得到充分探索。本研究在音素层面分析对抗音频,证明扰动利用了系统性的混淆模式,如元音央化和辅音替换。这些失真不仅误导转录结果,还会降解对说话人验证至关重要的语音学线索,导致身份漂移。以DeepSpeech作为ASR目标,我们生成定向对抗样本,并评估其在真实样本和冒名样本中对说话人嵌入表示的影响。在16个音素多样化的目标短语上的实验结果表明,对抗音频同时引发转录错误和身份漂移,这凸显了需要具备语音学感知能力的防御机制来确保ASR与说话人识别系统的鲁棒性。