While diffusion models have achieved great success in generating continuous signals such as images and audio, it remains elusive for diffusion models in learning discrete sequence data like natural languages. Although recent advances circumvent this challenge of discreteness by embedding discrete tokens as continuous surrogates, they still fall short of satisfactory generation quality. To understand this, we first dive deep into the denoised training protocol of diffusion-based sequence generative models and determine their three severe problems, i.e., 1) failing to learn, 2) lack of scalability, and 3) neglecting source conditions. We argue that these problems can be boiled down to the pitfall of the not completely eliminated discreteness in the embedding space, and the scale of noises is decisive herein. In this paper, we introduce DINOISER to facilitate diffusion models for sequence generation by manipulating noises. We propose to adaptively determine the range of sampled noise scales for counter-discreteness training; and encourage the proposed diffused sequence learner to leverage source conditions with amplified noise scales during inference. Experiments show that DINOISER enables consistent improvement over the baselines of previous diffusion-based sequence generative models on several conditional sequence modeling benchmarks thanks to both effective training and inference strategies. Analyses further verify that DINOISER can make better use of source conditions to govern its generative process.
翻译:尽管扩散模型在生成图像和音频等连续信号方面取得了巨大成功,但将其应用于自然语言等离散序列数据的学习仍面临挑战。虽然近期研究通过将离散词元嵌入为连续替代表示克服了离散性问题,但其生成质量仍不尽如人意。为理解这一现象,我们首先深入分析了基于扩散的序列生成模型的去噪训练机制,发现其存在三个严重问题:1)无法有效学习;2)缺乏可扩展性;3)忽略源条件。我们认为这些问题可归结为嵌入空间中离散性未完全消除的陷阱,而噪声规模在此起决定性作用。本文提出DINOISER框架,通过操控噪声来促进扩散模型的序列生成能力。我们提出自适应确定采样噪声规模范围以进行反离散性训练,并鼓励所提出的扩散序列学习器在推理阶段利用放大噪声规模下的源条件。实验表明,得益于有效的训练与推理策略,DINOISER在多个条件序列建模基准测试中相较于现有基于扩散的序列生成模型基线实现了一致性提升。进一步分析证实,DINOISER能够更有效地利用源条件来调控其生成过程。