End-to-End Speech Translation (E2E-ST) has seen significant advancements, yet current models are primarily benchmarked on curated, "clean" datasets. This overlooks critical real-world challenges, such as morphological robustness to inflectional variations common in non-native or dialectal speech. In this work, we adapt a text-based adversarial attack targeting inflectional morphology to the speech domain and demonstrate that state-of-the-art E2E-ST models are highly vulnerable it. While adversarial training effectively mitigates such risks in text-based tasks, generating high-quality adversarial speech data remains computationally expensive and technically challenging. To address this, we propose Cross-Modal Robustness Transfer (CMRT), a framework that transfers adversarial robustness from the text modality to the speech modality. Our method eliminates the requirement for adversarial speech data during training. Extensive experiments across four language pairs demonstrate that CMRT improves adversarial robustness by an average of more than 3 BLEU points, establishing a new baseline for robust E2E-ST without the overhead of generating adversarial speech.
翻译:端到端语音翻译(E2E-ST)已取得显著进展,然而现有模型主要在精心整理的“干净”数据集上进行基准测试。这忽略了现实世界中的关键挑战,例如对非母语或方言语音中常见的屈折变化的形态鲁棒性。在本工作中,我们将一种针对屈折形态的文本对抗攻击适配到语音领域,并证明最先进的E2E-ST模型对此高度脆弱。尽管对抗训练能有效缓解文本任务中的此类风险,但生成高质量的对抗性语音数据在计算上仍然昂贵且技术挑战巨大。为解决此问题,我们提出了跨模态鲁棒性迁移(CMRT),一个将对抗鲁棒性从文本模态迁移到语音模态的框架。我们的方法消除了训练过程中对对抗性语音数据的需求。在四个语言对上的大量实验表明,CMRT将对抗鲁棒性平均提升了超过3个BLEU分,为无需生成对抗性语音开销的鲁棒E2E-ST建立了新的基准。