Simultaneous translation models play a crucial role in facilitating communication. However, existing research primarily focuses on text-to-text or speech-to-text models, necessitating additional cascade components to achieve speech-to-speech translation. These pipeline methods suffer from error propagation and accumulate delays in each cascade component, resulting in reduced synchronization between the speaker and listener. To overcome these challenges, we propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X), which integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework. We develop a non-autoregressive decoder capable of concurrently generating multiple text or acoustic unit tokens upon receiving fixed-length speech chunks. The decoder can generate blank or repeated tokens and employ CTC decoding to dynamically adjust its latency. Experimental results show that NAST-S2X outperforms state-of-the-art models in both speech-to-text and speech-to-speech tasks. It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.
翻译:同步翻译模型在促进交流方面发挥着关键作用。然而,现有研究主要集中在文本到文本或语音到文本模型,需要额外的级联组件来实现语音到语音翻译。这些流水线方法存在错误传播问题,并在每个级联组件中累积延迟,导致说话者和听者之间的同步性降低。为了克服这些挑战,我们提出了一种用于同步语音翻译的新型非自回归生成框架(NAST-S2X),它将语音到文本和语音到语音任务集成到一个统一的端到端框架中。我们开发了一种非自回归解码器,能够在接收到固定长度的语音块时同时生成多个文本或声学单元标记。该解码器可以生成空白或重复标记,并利用CTC解码动态调整其延迟。实验结果表明,NAST-S2X在语音到文本和语音到语音任务中均优于现有最先进模型。它在小于3秒的延迟内实现了高质量的同步口译,并在离线生成中提供了28倍的解码加速。