Current simultaneous speech translation models can process audio only up to a few seconds long. Contemporary datasets provide an oracle segmentation into sentences based on human-annotated transcripts and translations. However, the segmentation into sentences is not available in the real world. Current speech segmentation approaches either offer poor segmentation quality or have to trade latency for quality. In this paper, we propose a novel segmentation approach for a low-latency end-to-end speech translation. We leverage the existing speech translation encoder-decoder architecture with ST CTC and show that it can perform the segmentation task without supervision or additional parameters. To the best of our knowledge, our method is the first that allows an actual end-to-end simultaneous speech translation, as the same model is used for translation and segmentation at the same time. On a diverse set of language pairs and in- and out-of-domain data, we show that the proposed approach achieves state-of-the-art quality at no additional computational cost.
翻译:当前的同声传译模型仅能处理时长数秒的音频。现有数据集依据人工标注的转录文本和翻译提供了基于句子的理想分割。然而,现实世界中并不存在这种基于句子的分割方法。现有语音分割方法要么分割质量较差,要么需要在延迟与质量之间进行权衡。本文针对低延迟端到端语音翻译提出一种新型分割方法。我们利用现有结合ST CTC的语音翻译编码器-解码器架构,证明其能在无需监督或额外参数的情况下完成分割任务。据我们所知,本方法是首个真正实现端到端同声传译的方法,因为同一模型同时承担翻译与分割功能。在涵盖多语言对及领域内外数据的实验中,我们证明了该方法能在不增加计算成本的前提下达到当前最优质量。