Automatic subtitling is the task of automatically translating the speech of audiovisual content into short pieces of timed text, i.e. subtitles and their corresponding timestamps. The generated subtitles need to conform to space and time requirements, while being synchronised with the speech and segmented in a way that facilitates comprehension. Given its considerable complexity, the task has so far been addressed through a pipeline of components that separately deal with transcribing, translating, and segmenting text into subtitles, as well as predicting timestamps. In this paper, we propose the first direct ST model for automatic subtitling that generates subtitles in the target language along with their timestamps with a single model. Our experiments on 7 language pairs show that our approach outperforms a cascade system in the same data condition, also being competitive with production tools on both in-domain and newly-released out-domain benchmarks covering new scenarios.
翻译:自动字幕生成是一项将音视频内容的语音自动翻译为带时间戳的短文本(即字幕及其对应起止时间)的任务。生成的字幕需满足空间和时间约束,同时与语音保持同步,并通过合理切分以提升理解流畅性。鉴于该任务的显著复杂性,现有方案通常采用流水线架构,分别处理语音转录、文本翻译、字幕切分和时间戳预测等子任务。本文提出了首个面向自动字幕生成的直接语音翻译模型,该模型能通过单一模型同时生成目标语字幕及其对应时间戳。我们在7个语言对上的实验表明,本方法在相同数据条件下优于级联系统,并且在涵盖新场景的领域内及新发布领域外基准测试中,其性能可与专业生产工具相媲美。