Direct speech-to-speech translation (S2ST) translates speech from one language into another using a single model. However, due to the presence of linguistic and acoustic diversity, the target speech follows a complex multimodal distribution, posing challenges to achieving both high-quality translations and fast decoding speeds for S2ST models. In this paper, we propose DASpeech, a non-autoregressive direct S2ST model which realizes both fast and high-quality S2ST. To better capture the complex distribution of the target speech, DASpeech adopts the two-pass architecture to decompose the generation process into two steps, where a linguistic decoder first generates the target text, and an acoustic decoder then generates the target speech based on the hidden states of the linguistic decoder. Specifically, we use the decoder of DA-Transformer as the linguistic decoder, and use FastSpeech 2 as the acoustic decoder. DA-Transformer models translations with a directed acyclic graph (DAG). To consider all potential paths in the DAG during training, we calculate the expected hidden states for each target token via dynamic programming, and feed them into the acoustic decoder to predict the target mel-spectrogram. During inference, we select the most probable path and take hidden states on that path as input to the acoustic decoder. Experiments on the CVSS Fr-En benchmark demonstrate that DASpeech can achieve comparable or even better performance than the state-of-the-art S2ST model Translatotron 2, while preserving up to 18.53x speedup compared to the autoregressive baseline. Compared with the previous non-autoregressive S2ST model, DASpeech does not rely on knowledge distillation and iterative decoding, achieving significant improvements in both translation quality and decoding speed. Furthermore, DASpeech shows the ability to preserve the speaker's voice of the source speech during translation.
翻译:摘要:直接语音到语音翻译(S2ST)通过单一模型将一种语言的语音翻译成另一种语言的语音。然而,由于语言和声学多样性的存在,目标语音呈现复杂的多模态分布,这给S2ST模型在实现高质量翻译和快速解码速度方面带来了挑战。本文提出DASpeech,一种非自回归直接S2ST模型,能够同时实现快速和高质量的S2ST。为了更好地捕捉目标语音的复杂分布,DASpeech采用双通道架构,将生成过程分解为两个步骤:语言解码器首先生成目标文本,然后声学解码器基于语言解码器的隐藏状态生成目标语音。具体而言,我们使用DA-Transformer的解码器作为语言解码器,使用FastSpeech 2作为声学解码器。DA-Transformer通过有向无环图(DAG)对翻译进行建模。为了在训练中考虑DAG中的所有潜在路径,我们通过动态规划计算每个目标标记的期望隐藏状态,并将其输入声学解码器以预测目标梅尔频谱图。在推理过程中,我们选择最可能路径,并将该路径上的隐藏状态作为声学解码器的输入。在CVSS Fr-En基准上的实验表明,DASpeech能够实现与最先进的S2ST模型Translatotron 2相当甚至更优的性能,同时相比自回归基线模型保持高达18.53倍的加速效果。与之前的非自回归S2ST模型相比,DASpeech不依赖知识蒸馏和迭代解码,在翻译质量和解码速度上均实现了显著提升。此外,DASpeech在翻译过程中展现出保留源语音说话人声音的能力。