End-to-end simultaneous speech translation (SimulST) outputs translation while receiving the streaming speech inputs (a.k.a. streaming speech translation), and hence needs to segment the speech inputs and then translate based on the current received speech. However, segmenting the speech inputs at unfavorable moments can disrupt the acoustic integrity and adversely affect the performance of the translation model. Therefore, learning to segment the speech inputs at those moments that are beneficial for the translation model to produce high-quality translation is the key to SimulST. Existing SimulST methods, either using the fixed-length segmentation or external segmentation model, always separate segmentation from the underlying translation model, where the gap results in segmentation outcomes that are not necessarily beneficial for the translation process. In this paper, we propose Differentiable Segmentation (DiSeg) for SimulST to directly learn segmentation from the underlying translation model. DiSeg turns hard segmentation into differentiable through the proposed expectation training, enabling it to be jointly trained with the translation model and thereby learn translation-beneficial segmentation. Experimental results demonstrate that DiSeg achieves state-of-the-art performance and exhibits superior segmentation capability.
翻译:端到端同步语音翻译(SimulST)在接收流式语音输入(也称为流式语音翻译)的同时输出翻译结果,因此需要对语音输入进行分割,并基于当前接收到的语音进行翻译。然而,在不利时刻分割语音输入会破坏声学完整性,并对翻译模型的性能产生不利影响。因此,学习在有利于翻译模型生成高质量翻译的时刻分割语音输入,是SimulST的关键。现有SimulST方法,无论是采用固定长度分割还是外部分割模型,总是将分割过程与底层翻译模型分离,这种分离导致分割结果不一定有利于翻译过程。本文提出用于SimulST的可微分分割(DiSeg),以直接从底层翻译模型中学习分割。DiSeg通过提出的期望训练将硬分割转化为可微分过程,使其能够与翻译模型联合训练,从而学习到有利于翻译的分割。实验结果表明,DiSeg达到了最先进的性能,并展现出卓越的分割能力。