Training end-to-end speech translation (ST) systems requires sufficiently large-scale data, which is unavailable for most language pairs and domains. One practical solution to the data scarcity issue is to convert machine translation data (MT) to ST data via text-to-speech (TTS) systems. Yet, using TTS systems can be tedious and slow, as the conversion needs to be done for each MT dataset. In this work, we propose a simple, scalable and effective data augmentation technique, i.e., SpokenVocab, to convert MT data to ST data on-the-fly. The idea is to retrieve and stitch audio snippets from a SpokenVocab bank according to words in an MT sequence. Our experiments on multiple language pairs from Must-C show that this method outperforms strong baselines by an average of 1.83 BLEU scores, and it performs equally well as TTS-generated speech. We also showcase how SpokenVocab can be applied in code-switching ST for which often no TTS systems exit. Our code is available at https://github.com/mingzi151/SpokenVocab
翻译:训练端到端语音翻译(ST)系统需要足够大规模的数据,而大多数语言对和领域缺乏此类数据。解决数据稀缺问题的一种实用方案是,通过文本转语音(TTS)系统将机器翻译(MT)数据转换为ST数据。然而,使用TTS系统可能繁琐且缓慢,因为需要为每个MT数据集进行转换。在这项工作中,我们提出一种简单、可扩展且有效的数据增强技术,即SpokenVocab,用于实时将MT数据转换为ST数据。其核心思想是根据MT序列中的词汇,从SpokenVocab库中检索并拼接音频片段。我们在Must-C多语言对上的实验表明,该方法平均比强基线高出1.83个BLEU分数,且性能与TTS生成的语音相当。我们还展示了SpokenVocab如何应用于通常缺乏TTS系统的代码切换ST任务。我们的代码可从https://github.com/mingzi151/SpokenVocab获取。