This paper describes CMU's submission to the IWSLT 2024 Simultaneous Speech Translation (SST) task for translating English speech to German text in a streaming manner. Our end-to-end speech-to-text (ST) system integrates the WavLM speech encoder, a modality adapter, and the Llama2-7B-Base model as the decoder. We employ a two-stage training approach: initially, we align the representations of speech and text, followed by full fine-tuning. Both stages are trained on MuST-c v2 data with cross-entropy loss. We adapt our offline ST model for SST using a simple fixed hold-n policy. Experiments show that our model obtains an offline BLEU score of 31.1 and a BLEU score of 29.5 under 2 seconds latency on the MuST-C-v2 tst-COMMON.
翻译:本文描述了卡内基梅隆大学为IWSLT 2024同声语音翻译任务提交的系统,该系统以流式方式将英语语音翻译为德语文本。我们的端到端语音到文本系统集成了WavLM语音编码器、模态适配器以及Llama2-7B-Base模型作为解码器。我们采用两阶段训练方法:首先对齐语音与文本的表征,随后进行全量微调。两个阶段均在MuST-c v2数据集上使用交叉熵损失进行训练。我们采用简单的固定等待n个词策略,将离线语音翻译模型适配用于同声翻译任务。实验表明,我们的模型在MuST-C-v2 tst-COMMON测试集上取得了31.1的离线BLEU分数,以及在2秒延迟下29.5的BLEU分数。