How can speech-to-text translation (ST) perform as well as machine translation (MT)? The key point is to bridge the modality gap between speech and text so that useful MT techniques can be applied to ST. Recently, the approach of representing speech with unsupervised discrete units yields a new way to ease the modality problem. This motivates us to propose Discrete Unit Back-translation (DUB) to answer two questions: (1) Is it better to represent speech with discrete units than with continuous features in direct ST? (2) How much benefit can useful MT techniques bring to ST? With DUB, the back-translation technique can successfully be applied on direct ST and obtains an average boost of 5.5 BLEU on MuST-C En-De/Fr/Es. In the low-resource language scenario, our method achieves comparable performance to existing methods that rely on large-scale external data. Code and models are available at https://github.com/0nutation/DUB.
翻译:语音到文本翻译(ST)如何能达到与机器翻译(MT)相当的性能?关键在于弥合语音与文本之间的模态鸿沟,从而使有效的MT技术能够应用于ST。近期,利用无监督离散单元表示语音的方法为缓解模态问题提供了新途径。这促使我们提出离散单元回译(DUB),旨在回答两个问题:(1)在直接ST中,用离散单元表示语音是否比连续特征更优?(2)有效的MT技术能为ST带来多大增益?通过DUB,回译技术可成功应用于直接ST,并在MuST-C英-德/法/西数据集上平均提升5.5个BLEU值。在低资源语言场景下,本方法性能与依赖大规模外部数据的现有方法相当。代码与模型开源地址为:https://github.com/0nutation/DUB。