Speech-to-speech translation is a typical sequence-to-sequence learning task that naturally has two directions. How to effectively leverage bidirectional supervision signals to produce high-fidelity audio for both directions? Existing approaches either train two separate models or a multitask-learned model with low efficiency and inferior performance. In this paper, we propose a duplex diffusion model that applies diffusion probabilistic models to both sides of a reversible duplex Conformer, so that either end can simultaneously input and output a distinct language's speech. Our model enables reversible speech translation by simply flipping the input and output ends. Experiments show that our model achieves the first success of reversible speech translation with significant improvements of ASR-BLEU scores compared with a list of state-of-the-art baselines.
翻译:语音到语音翻译是一项典型的序列到序列学习任务,天然具有两个方向。如何有效利用双向监督信号,为两个方向生成高保真音频?现有方法要么训练两个独立的模型,要么训练一个多任务学习模型,但效率低下且性能欠佳。本文提出一种双通道扩散模型,将扩散概率模型应用于可逆双通道Conformer的两侧,使得任一端均可同时输入并输出不同语言的语音。通过简单翻转输入和输出端,我们的模型即可实现可逆语音翻译。实验表明,该模型首次成功实现了可逆语音翻译,与一系列最先进基线方法相比,在ASR-BLEU分数上取得了显著提升。