End-to-end automatic speech translation (AST) relies on data that combines audio inputs with text translation outputs. Previous work used existing large parallel corpora of transcriptions and translations in a knowledge distillation (KD) setup to distill a neural machine translation (NMT) into an AST student model. While KD allows using larger pretrained models, the reliance of previous KD approaches on manual audio transcripts in the data pipeline restricts the applicability of this framework to AST. We present an imitation learning approach where a teacher NMT system corrects the errors of an AST student without relying on manual transcripts. We show that the NMT teacher can recover from errors in automatic transcriptions and is able to correct erroneous translations of the AST student, leading to improvements of about 4 BLEU points over the standard AST end-to-end baseline on the English-German CoVoST-2 and MuST-C datasets, respectively. Code and data are publicly available.\footnote{\url{https://github.com/HubReb/imitkd_ast/releases/tag/v1.1}}
翻译:端到端自动语音翻译(AST)依赖于将音频输入与文本翻译输出相结合的数据。先前的工作利用现有的转录与翻译大型平行语料库,通过知识蒸馏(KD)设置将神经机器翻译(NMT)模型蒸馏到AST学生模型中。尽管KD允许使用更大的预训练模型,但先前KD方法在数据管线中对人工音频转录本的依赖限制了该框架在AST中的适用性。我们提出一种模仿学习方法,其中教师NMT系统无需依赖人工转录本即可纠正AST学生的错误。我们证明NMT教师能够从自动转录的错误中恢复,并能够纠正AST学生的错误翻译,从而在英语-德语CoVoST-2和MuST-C数据集上分别比标准AST端到端基线提升约4个BLEU点。代码和数据已公开。\footnote{\url{https://github.com/HubReb/imitkd_ast/releases/tag/v1.1}}