Subtitling plays a crucial role in enhancing the accessibility of audiovisual content and encompasses three primary subtasks: translating spoken dialogue, segmenting translations into concise textual units, and estimating timestamps that govern their on-screen duration. Past attempts to automate this process rely, to varying degrees, on automatic transcripts, employed diversely for the three subtasks. In response to the acknowledged limitations associated with this reliance on transcripts, recent research has shifted towards transcription-free solutions for translation and segmentation, leaving the direct generation of timestamps as uncharted territory. To fill this gap, we introduce the first direct model capable of producing automatic subtitles, entirely eliminating any dependence on intermediate transcripts also for timestamp prediction. Experimental results, backed by manual evaluation, showcase our solution's new state-of-the-art performance across multiple language pairs and diverse conditions.
翻译:字幕制作在提升视听内容可及性方面起着关键作用,包含三大核心子任务:翻译口语对话、将译文切分为简洁的文本单元、以及估算控制其在屏幕上持续时长的时间戳。过往的自动化尝试在不同程度上依赖于自动转录文本,并针对三个子任务采用了差异化的应用方式。针对这种转录依赖所公认的局限性,近期研究已转向无转录解决方案用于翻译和切分,而时间戳的直接生成仍属未探明领域。为填补这一空白,我们提出了首个能够直接生成自动字幕的模型,完全消除了对中间转录文本的依赖,同时也包括时间戳预测环节。基于人工评估的实验结果表明,我们的解决方案在多种语言对及多样化条件下取得了新的最优性能。