End-to-end Speech Translation (ST) aims to convert speech into target text within a unified model. The inherent differences between speech and text modalities often impede effective cross-modal and cross-lingual transfer. Existing methods typically employ hard alignment (H-Align) of individual speech and text segments, which can degrade textual representations. To address this, we introduce Soft Alignment (S-Align), using adversarial training to align the representation spaces of both modalities. S-Align creates a modality-invariant space while preserving individual modality quality. Experiments on three languages from the MuST-C dataset show S-Align outperforms H-Align across multiple tasks and offers translation capabilities on par with specialized translation models.
翻译:端到端语音翻译(ST)旨在通过统一模型将语音转换为目标文本。语音与文本模态的内在差异常阻碍跨模态与跨语言的有效迁移。现有方法通常对单个语音和文本片段采用硬对齐(H-Align),这可能导致文本表征质量下降。为此,我们提出软对齐(S-Align),通过对抗训练对齐两种模态的表征空间。S-Align在保持各模态独立质量的同时构建了模态不变空间。在MuST-C数据集三个语言上的实验表明,S-Align在多项任务中均优于H-Align,且能提供与专用翻译模型相当的翻译能力。