DiffS2UT: A Semantic Preserving Diffusion Model for Textless Direct Speech-to-Speech Translation

While Diffusion Generative Models have achieved great success on image generation tasks, how to efficiently and effectively incorporate them into speech generation especially translation tasks remains a non-trivial problem. Specifically, due to the low information density of speech data, the transformed discrete speech unit sequence is much longer than the corresponding text transcription, posing significant challenges to existing auto-regressive models. Furthermore, it is not optimal to brutally apply discrete diffusion on the speech unit sequence while disregarding the continuous space structure, which will degrade the generation performance significantly. In this paper, we propose a novel diffusion model by applying the diffusion forward process in the \textit{continuous} speech representation space, while employing the diffusion backward process in the \textit{discrete} speech unit space. In this way, we preserve the semantic structure of the continuous speech representation space in the diffusion process and integrate the continuous and discrete diffusion models. We conduct extensive experiments on the textless direct speech-to-speech translation task, where the proposed method achieves comparable results to the computationally intensive auto-regressive baselines (500 steps on average) with significantly fewer decoding steps (50 steps).

翻译：尽管扩散生成模型在图像生成任务中取得了巨大成功，如何高效且有效地将其应用于语音生成特别是翻译任务仍是一个具有挑战性的问题。具体而言，由于语音数据的信息密度较低，转换后的离散语音单元序列远长于对应的文本转录，这给现有自回归模型带来了显著挑战。此外，无视连续空间结构而粗暴地在语音单元序列上应用离散扩散并非最优方案，这会严重降低生成性能。本文提出一种新颖的扩散模型：在连续语音表示空间中执行扩散前向过程，同时在离散语音单元空间中执行扩散反向过程。通过这种方式，我们在扩散过程中保留了连续语音表示空间的语义结构，并实现了连续离散扩散模型的融合。针对无文本直接语音到语音翻译任务开展了大量实验，结果表明所提方法能以显著更少的解码步数（50步）达到与计算密集型自回归基线（平均500步）相当的性能。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日