Continuous diffusion for categorical data is a framework belonging to the diffusion family and aiming at generating discrete data. The scientific interest to such models has been constantly increasing these days because researchers try to achieve a challenging goal of finding reasonable alternatives to autoregressive large language models. In this paper, we study the properties of the structure of the latent space corresponding to discrete tokens expressed in terms of Kullback-Leibler divergence on diffusion path measures and accuracy of the correct token prediction by the optimally trained diffusion model. We find that FSQ tokenization scheme has the latent space structure with the properties that make it best suited for continuous diffusion for categorical data as verified through rigorous theoretical analysis and numerical experiments. To validate our findings in real-life scenario, we train several text-to-speech diffusion models having speech tokens as intermediate acoustic features, and show that the one based on FSQ tokens indeed performs the best, and, moreover, it outperforms its strong LLM-based counterpart, at the same time being significantly smaller and faster.
翻译:分类数据的连续扩散是一类属于扩散家族的框架,旨在生成离散数据。近年来,由于研究者们致力于寻找自回归大语言模型的有效替代方案这一挑战性目标,对此类模型的科学兴趣持续增长。本文研究了与离散令牌对应的潜在空间结构性质,具体从扩散路径度量的库尔贝克-莱布勒散度以及最优训练扩散模型对正确令牌的预测精度两个维度展开分析。通过严格的理论分析与数值实验验证,我们发现FSQ令牌化方案的潜在空间结构具有最适合分类数据连续扩散的特性。为了在实际场景中验证这一发现,我们训练了多个以语音令牌为中间声学特征的语音合成扩散模型,结果表明基于FSQ令牌的模型确实表现最优,并且值得注意的是,该模型在显著更小更快的条件下,其性能超越了强大的基于大语言模型的同类模型。