Recent work in speech-to-speech translation (S2ST) has focused primarily on offline settings, where the full input utterance is available before any output is given. This, however, is not reasonable in many real-world scenarios. In latency-sensitive applications, rather than waiting for the full utterance, translations should be spoken as soon as the information in the input is present. In this work, we introduce a system for simultaneous S2ST targeting real-world use cases. Our system supports translation from 57 languages to English with tunable parameters for dynamically adjusting the latency of the output -- including four policies for determining when to speak an output sequence. We show that these policies achieve offline-level accuracy with minimal increases in latency over a Greedy (wait-$k$) baseline. We open-source our evaluation code and interactive test script to aid future SimulS2ST research and application development.
翻译:近年来,语音到语音翻译(S2ST)的研究主要集中在离线场景,即模型在输出内容前需获取完整输入语句。然而,在许多实际应用中这一设定并不合理。在延迟敏感型任务中,与其等待完整语句,翻译应当在输入信息充分时即时输出。本文提出了一套面向真实场景的同步语音到语音翻译系统,支持57种语言到英语的翻译,并配备可调参数以动态调节输出延迟——包括四种用于确定输出序列时机的策略。实验表明,这些策略在Greedy(等待k步)基线基础上,仅以极小的延迟增长实现了离线级翻译质量。我们开源了评估代码与交互式测试脚本,旨在推动未来SimulS2ST研究与应用开发。