Speech translation (ST) systems translate speech in one language to text in another language. End-to-end ST systems (e2e-ST) have gained popularity over cascade systems because of their enhanced performance due to reduced latency and computational cost. Though resource intensive, e2e-ST systems have the inherent ability to retain para and non-linguistic characteristics of the speech unlike cascade systems. In this paper, we propose to use an e2e architecture for English-Hindi (en-hi) ST. We use two imperfect machine translation (MT) services to translate Libri-trans en text into hi text. While each service gives MT data individually to generate parallel ST data, we propose a data augmentation strategy of noisy MT data to aid robust ST. The main contribution of this paper is the proposal of a data augmentation strategy. We show that this results in better ST (BLEU score) compared to brute force augmentation of MT data. We observed an absolute improvement of 1.59 BLEU score with our approach.
翻译:语音翻译(ST)系统将一种语言的语音转换为另一种语言的文本。端到端语音翻译系统(e2e-ST)因其降低延迟和计算成本带来的性能提升而比级联系统更受欢迎。尽管资源密集,e2e-ST系统具有保留语音中副语言和非语言特征的固有能力,这与级联系统不同。本文提出使用e2e架构实现英语-印地语(en-hi)语音翻译。我们利用两个不完美的机器翻译(MT)服务将Libri-trans的en文本翻译为hi文本。虽然每个服务分别提供生成并行ST数据的机器翻译数据,我们提出了一种基于噪声机器翻译数据的数据增强策略以支持鲁棒语音翻译。本文的主要贡献在于提出了一种数据增强策略。实验表明,与暴力增强机器翻译数据相比,该方法能获得更好的语音翻译性能(BLEU分数)。我们的方法在BLEU分数上实现了1.59个绝对提升。