This work describes the participation of the MLLP-VRAIN research group in the shared task of the IWSLT 2025 Simultaneous Speech Translation track. Our submission addresses the unique challenges of real-time translation of long-form speech by developing a modular cascade system that adapts strong pre-trained models to streaming scenarios. We combine Whisper Large-V3-Turbo for ASR with the multilingual NLLB-3.3B model for MT, implementing lightweight adaptation techniques rather than training new end-to-end models from scratch. Our approach employs document-level adaptation with prefix training to enhance the MT model's ability to handle incomplete inputs, while incorporating adaptive emission policies including a wait-$k$ strategy and RALCP for managing the translation stream. Specialized buffer management techniques and segmentation strategies ensure coherent translations across long audio sequences. Experimental results on the ACL60/60 dataset demonstrate that our system achieves a favorable balance between translation quality and latency, with a BLEU score of 31.96 and non-computational-aware StreamLAAL latency of 2.94 seconds. Our final model achieves a preliminary score on the official test set (IWSLT25Instruct) of 29.8 BLEU. Our work demonstrates that carefully adapted pre-trained components can create effective simultaneous translation systems for long-form content without requiring extensive in-domain parallel data or specialized end-to-end training.
翻译:本文介绍了 MLLP-VRAIN 研究组在 IWSLT 2025 同声语音翻译赛道共享任务中的参与工作。我们的提交方案通过开发一个模块化级联系统,将强大的预训练模型适配到流式场景,以应对长语音实时翻译的独特挑战。我们结合使用 Whisper Large-V3-Turbo 进行自动语音识别,以及多语言 NLLB-3.3B 模型进行机器翻译,并采用轻量级适配技术,而非从头训练新的端到端模型。我们的方法采用基于前缀训练的文档级适配,以增强机器翻译模型处理不完整输入的能力,同时结合了自适应输出策略,包括 wait-$k$ 策略和 RALCP 来管理翻译流。专门的缓冲区管理技术和分段策略确保了长音频序列中翻译的连贯性。在 ACL60/60 数据集上的实验结果表明,我们的系统在翻译质量和延迟之间取得了良好的平衡,BLEU 分数为 31.96,非计算感知的 StreamLAAL 延迟为 2.94 秒。我们的最终模型在官方测试集(IWSLT25Instruct)上的初步得分为 29.8 BLEU。我们的工作表明,经过精心适配的预训练组件能够构建有效的长内容同声翻译系统,而无需大量的领域内平行数据或专门的端到端训练。