The success of end-to-end speech-to-text translation (ST) is often achieved by utilizing source transcripts, e.g., by pre-training with automatic speech recognition (ASR) and machine translation (MT) tasks, or by introducing additional ASR and MT data. Unfortunately, transcripts are only sometimes available since numerous unwritten languages exist worldwide. In this paper, we aim to utilize large amounts of target-side monolingual data to enhance ST without transcripts. Motivated by the remarkable success of back translation in MT, we develop a back translation algorithm for ST (BT4ST) to synthesize pseudo ST data from monolingual target data. To ease the challenges posed by short-to-long generation and one-to-many mapping, we introduce self-supervised discrete units and achieve back translation by cascading a target-to-unit model and a unit-to-speech model. With our synthetic ST data, we achieve an average boost of 2.3 BLEU on MuST-C En-De, En-Fr, and En-Es datasets. More experiments show that our method is especially effective in low-resource scenarios.
翻译:端到端语音到文本翻译(ST)的成功通常依赖于利用源语言转录文本,例如通过自动语音识别(ASR)和机器翻译(MT)任务进行预训练,或引入额外的ASR和MT数据。然而,由于全球存在大量无文字语言,转录文本并不总是可用。本文旨在利用大量目标端单语数据来增强无转录文本情况下的ST性能。受反向翻译在MT中显著成功的启发,我们为ST开发了一种反向翻译算法(BT4ST),从目标端单语数据中合成伪ST数据。为缓解短到长生成及一对多映射带来的挑战,我们引入了自监督离散单元,并通过级联目标到单元模型与单元到语音模型实现反向翻译。利用合成的ST数据,我们在MuST-C英德、英法、英西数据集上平均取得2.3 BLEU的提升。更多实验表明,该方法在低资源场景下尤为有效。