Diffusion models have achieved state-of-the-art synthesis quality on both visual and audio tasks, and recent works further adapt them to textual data by diffusing on the embedding space. In this paper, we conduct systematic studies and analyze the challenges between the continuous data space and the embedding space which have not been carefully explored. Firstly, the data distribution is learnable for embeddings, which may lead to the collapse of the loss function. Secondly, as the norm of embeddings varies between popular and rare words, adding the same noise scale will lead to sub-optimal results. In addition, we find the normal level of noise causes insufficient training of the model. To address the above challenges, we propose Difformer, an embedding diffusion model based on Transformer, which consists of three essential modules including an additional anchor loss function, a layer normalization module for embeddings, and a noise factor to the Gaussian noise. Experiments on two seminal text generation tasks including machine translation and text summarization show the superiority of Difformer over compared embedding diffusion baselines.
翻译:摘要:扩散模型在视觉和音频任务上已取得最先进的合成质量,近期研究进一步将其适配到文本数据,通过在嵌入空间进行扩散。本文开展了系统性研究,分析了连续数据空间与嵌入空间之间尚未被深入探索的挑战。首先,嵌入的数据分布具有可学习性,可能导致损失函数崩溃。其次,由于常见词与稀有词的嵌入范数存在差异,施加相同噪声尺度会导致次优结果。此外,我们发现标准噪声级别会导致模型训练不足。为解决上述挑战,我们提出了Difformer——一种基于Transformer的嵌入扩散模型,包含三个核心模块:附加锚点损失函数、嵌入层归一化模块和高斯噪声的噪声因子。在机器翻译和文本摘要两项经典文本生成任务上的实验表明,Difformer相较于对比的嵌入扩散基线方法具有优越性。