While continuous diffusion has shown remarkable success in continuous domains such as image generation, its direct application to discrete data has underperformed compared to purely discrete formulations. This gap is counterintuitive, given that continuous diffusion learns score functions that enable joint evolution across multiple positions. To understand this gap, we introduce token identifiability as an analytical framework for understanding how Gaussian noise corrupts discrete data through two mechanisms: discrete identity corruption and continuous rank degradation. We reveal that these mechanisms scale differently with vocabulary size, creating a temporal dissonance: at noise levels where discrete corruption preserves enough structure for conditional learning, continuous denoising is trivial; at noise levels where continuous denoising is meaningful, discrete corruption destroys nearly all conditional structure. To solve this, we propose CANDI (Continuous ANd DIscrete diffusion), a hybrid framework that decouples discrete and continuous corruption, enabling simultaneous learning of both conditional structure and continuous geometry. We empirically validate the temporal dissonance phenomenon and demonstrate that CANDI successfully avoids it. This unlocks the benefits of continuous diffusion for discrete spaces: on controlled generation, CANDI enables classifier-based guidance with off-the-shelf classifiers through simple gradient addition; on text generation, CANDI outperforms masked diffusion at low NFE, demonstrating the value of learning continuous gradients for discrete spaces.
翻译:尽管连续扩散在图像生成等连续域中取得了显著成功,但其直接应用于离散数据的表现却逊于纯离散方法。这一差距是反直觉的,因为连续扩散学习到的评分函数能够实现跨多个位置的联合演化。为理解这一差距,我们引入了**标记可识别性**作为一个分析框架,用以理解高斯噪声如何通过两种机制破坏离散数据:离散身份破坏与连续秩退化。我们发现这两种机制随词汇表大小的缩放方式不同,从而产生了一种**时间失调**现象:在噪声水平下,当离散破坏保留了足够结构以供条件学习时,连续去噪变得微不足道;而在连续去噪具有实际意义的噪声水平下,离散破坏则几乎摧毁了所有的条件结构。为解决此问题,我们提出了CANDI(连续与离散扩散),这是一个混合框架,它将离散与连续破坏解耦,从而能够同时学习条件结构与连续几何。我们通过实验验证了时间失调现象,并证明CANDI成功避免了该现象。这为离散空间解锁了连续扩散的优势:在受控生成任务中,CANDI通过简单的梯度加法,能够利用现成的分类器实现基于分类器的引导;在文本生成任务中,CANDI在低NFE(函数评估次数)下优于掩码扩散模型,证明了为离散空间学习连续梯度的价值。