In Simultaneous Machine Translation (SiMT) systems, training with a simultaneous interpretation (SI) corpus is an effective method for achieving high-quality yet low-latency systems. However, it is very challenging to curate such a corpus due to limitations in the abilities of annotators, and hence, existing SI corpora are limited. Therefore, we propose a method to convert existing speech translation corpora into interpretation-style data, maintaining the original word order and preserving the entire source content using Large Language Models (LLM-SI-Corpus). We demonstrate that fine-tuning SiMT models in text-to-text and speech-to-text settings with the LLM-SI-Corpus reduces latencies while maintaining the same level of quality as the models trained with offline datasets. The LLM-SI-Corpus is available at \url{https://github.com/yusuke1997/LLM-SI-Corpus}.
翻译:在同声机器翻译(SiMT)系统中,使用同声传译(SI)语料库进行训练是实现高质量且低延迟系统的有效方法。然而,由于标注人员能力的限制,整理此类语料库极具挑战性,因此现有的SI语料库十分有限。为此,我们提出了一种利用大语言模型(LLM-SI-Corpus)将现有语音翻译语料库转化为口译风格数据的方法,该方法可保持原始词序并完整保留源语言内容。我们证明,在文本到文本及语音到文本的设置下,使用LLM-SI-Corpus微调SiMT模型,能在保持与离线数据集训练模型同等质量水平的同时降低延迟。LLM-SI-Corpus可通过\url{https://github.com/yusuke1997/LLM-SI-Corpus}获取。