Diffusion based generative models have achieved unprecedented fidelity in synthesizing high dimensional data, yet the theoretical mechanisms governing multimodal generation remain poorly understood. Here, we present a theoretical framework for coupled diffusion models, using coupled Ornstein-Uhlenbeck processes as a tractable model. By using the nonequilibrium statistical physics of dynamical phase transitions, we demonstrate that multimodal generation is governed by a spectral hierarchy of interaction timescales rather than simultaneous resolution. A key prediction is the ``synchronization gap'', a temporal window during the reverse generative process where distinct eigenmodes stabilize at different rates, providing a theoretical explanation for common desynchronization artifacts. We derive analytical conditions for speciation and collapse times under both symmetric and anisotropic coupling regimes, establishing strict bounds for coupling strength to avoid unstable symmetry breaking. We show that the coupling strength acts as a spectral filter that enforces a tunable temporal hierarchy on generation. We support these predictions through controlled experiments with diffusion models trained on MNIST datasets and exact score samplers. These results motivate time dependent coupling schedules that target mode specific timescales, offering a potential alternative to ad hoc guidance tuning.
翻译:基于扩散的生成模型在合成高维数据方面达到了前所未有的保真度,然而支配多模态生成的理论机制仍知之甚少。本文提出了一个耦合扩散模型的理论框架,使用耦合的Ornstein-Uhlenbeck过程作为一个可处理的模型。通过运用非平衡统计物理中的动力学相变理论,我们证明了多模态生成是由相互作用时间尺度的谱层次结构所支配,而非同步解析。一个关键的预测是“同步间隙”,即在反向生成过程中的一个时间窗口,其中不同的本征模态以不同的速率稳定下来,这为常见的去同步伪影提供了理论解释。我们推导了在对称和各向异性耦合机制下物种形成时间和坍缩时间的解析条件,为耦合强度建立了严格的界限以避免不稳定的对称性破缺。我们证明,耦合强度充当了一个谱滤波器,对生成过程施加了一个可调的时间层次结构。我们通过在MNIST数据集上训练的扩散模型和精确得分采样器进行的受控实验支持了这些预测。这些结果启发了针对模态特定时间尺度的时间依赖性耦合调度方案,为临时的引导调优提供了一种潜在的替代方案。