This paper presents CMU's submission to the IWSLT 2025 Simultaneous Speech Translation (SST) task for translating unsegmented English speech into Chinese and German text in a streaming manner. Our end-to-end speech-to-text system integrates a chunkwise causal Wav2Vec 2.0 speech encoder, an adapter, and the Qwen2.5-7B-Instruct as the decoder. We use a two-stage simultaneous training procedure on robust speech segments curated from LibriSpeech, CommonVoice, and VoxPopuli datasets, utilizing standard cross-entropy loss. Our model supports adjustable latency through a configurable latency multiplier. Experimental results demonstrate that our system achieves 44.3 BLEU for English-to-Chinese and 25.1 BLEU for English-to-German translations on the ACL60/60 development set, with computation-aware latencies of 2.7 seconds and 2.3 seconds, and theoretical latencies of 2.2 and 1.7 seconds, respectively.
翻译:本文介绍了卡内基梅隆大学为IWSLT 2025同声语音翻译任务提交的系统,该系统以流式方式将未分段的英语语音实时翻译为中文和德文文本。我们的端到端语音到文本系统集成了一个分块因果Wav2Vec 2.0语音编码器、一个适配器以及Qwen2.5-7B-Instruct作为解码器。我们采用两阶段同声训练流程,在从LibriSpeech、CommonVoice和VoxPopuli数据集中筛选的鲁棒语音片段上进行训练,并使用标准交叉熵损失。我们的模型通过可配置的延迟乘数支持可调节的延迟。实验结果表明,我们的系统在ACL60/60开发集上实现了英语到中文翻译44.3 BLEU和英语到德文翻译25.1 BLEU的分数,其计算感知延迟分别为2.7秒和2.3秒,理论延迟分别为2.2秒和1.7秒。