While diffusion models have achieved great success in generating continuous signals such as images and audio, it remains elusive for diffusion models in learning discrete sequence data like natural languages. Although recent advances circumvent this challenge of discreteness by embedding discrete tokens as continuous surrogates, they still fall short of satisfactory generation quality. To understand this, we first dive deep into the denoised training protocol of diffusion-based sequence generative models and determine their three severe problems, i.e., 1) failing to learn, 2) lack of scalability, and 3) neglecting source conditions. We argue that these problems can be boiled down to the pitfall of the not completely eliminated discreteness in the embedding space, and the scale of noises is decisive herein. In this paper, we introduce DINOISER to facilitate diffusion models for sequence generation by manipulating noises. We propose to adaptively determine the range of sampled noise scales for counter-discreteness training; and encourage the proposed diffused sequence learner to leverage source conditions with amplified noise scales during inference. Experiments show that DINOISER enables consistent improvement over the baselines of previous diffusion-based sequence generative models on several conditional sequence modeling benchmarks thanks to both effective training and inference strategies. Analyses further verify that DINOISER can make better use of source conditions to govern its generative process.
翻译:尽管扩散模型在生成图像和音频等连续信号方面取得了巨大成功,但在自然语言等离散序列数据的学习中仍面临挑战。虽然近期研究通过将离散词元嵌入为连续代理变量的方式绕过了离散性难题,但其生成质量仍未达到理想水平。为解析此现象,我们首先深入剖析基于扩散的序列生成模型的去噪训练范式,发现三个关键缺陷:1) 无法有效学习;2) 缺乏可扩展性;3) 忽视源条件。我们认为这些问题可归结为嵌入空间中未被完全消除的离散性陷阱,而噪声尺度在此过程中起决定性作用。本文提出DINOISER方法,通过噪声操控优化扩散模型的序列生成能力。具体而言,我们提出自适应确定采样噪声尺度范围以实现反离散训练;同时鼓励所提出的扩散序列学习器在推理阶段利用放大噪声尺度的源条件。实验表明,得益于高效的训练与推理策略,DINOISER在多个条件序列建模基准测试中均优于现有基于扩散的序列生成模型。进一步分析证实,DINOISER能更有效地利用源条件来调控其生成过程。