Speech segmentation is an essential part of speech translation (ST) systems in real-world scenarios. Since most ST models are designed to process speech segments, long-form audio must be partitioned into shorter segments before translation. Recently, data-driven approaches for the speech segmentation task have been developed. Although the approaches improve overall translation quality, a performance gap exists due to a mismatch between the models and ST systems. In addition, the prior works require large self-supervised speech models, which consume significant computational resources. In this work, we propose a segmentation model that achieves better speech translation quality with a small model size. We propose an ASR-with-punctuation task as an effective pre-training strategy for the segmentation model. We also show that proper integration of the speech segmentation model into the underlying ST system is critical to improve overall translation quality at inference time.
翻译:语音分割是现实场景中语音翻译系统的关键组成部分。由于大多数语音翻译模型设计用于处理语音片段,长音频在翻译前必须被分割为较短的片段。近年来,针对语音分割任务的数据驱动方法已得到发展。尽管这些方法提升了整体翻译质量,但由于模型与语音翻译系统之间的不匹配,仍存在性能差距。此外,先前的研究需要大型自监督语音模型,消耗大量计算资源。本研究提出一种分割模型,以较小的模型尺寸实现更优的语音翻译质量。我们提出采用带标点的自动语音识别任务作为该分割模型的有效预训练策略。同时证明,在推理阶段将语音分割模型恰当地集成到底层语音翻译系统中,对提升整体翻译质量至关重要。