Recently, diffusion models have achieved a great performance with a small dataset of size $n$ and a fast optimization process. However, the estimation error of diffusion models suffers from the curse of dimensionality $n^{-1/D}$ with the data dimension $D$. Since images are usually a union of low-dimensional manifolds, current works model the data as a union of linear subspaces with Gaussian latent and achieve a $1/\sqrt{n}$ bound. Though this modeling reflects the multi-manifold property, the Gaussian latent can not capture the multi-modal property of the latent manifold. To bridge this gap, we propose the mixture subspace of low-rank mixture of Gaussian (MoLR-MoG) modeling, which models the target data as a union of $K$ linear subspaces, and each subspace admits a mixture of Gaussian latent ($n_k$ modals with dimension $d_k$). With this modeling, the corresponding score function naturally has a mixture of expert (MoE) structure, captures the multi-modal information, and contains nonlinear property. We first conduct real-world experiments to show that the generation results of MoE-latent MoG NN are much better than MoE-latent Gaussian score. Furthermore, MoE-latent MoG NN achieves a comparable performance with MoE-latent Unet with $10 \times$ parameters. These results indicate that the MoLR-MoG modeling is reasonable and suitable for real-world data. After that, based on such MoE-latent MoG score, we provide a $R^4\sqrt{Σ_{k=1}^Kn_k}\sqrt{Σ_{k=1}^Kn_kd_k}/\sqrt{n}$ estimation error, which escapes the curse of dimensionality by using data structure. Finally, we study the optimization process and prove the convergence guarantee under the MoLR-MoG modeling. Combined with these results, under a setting close to real-world data, this work explains why diffusion models only require a small training sample and enjoy a fast optimization process to achieve a great performance.
翻译:近年来,扩散模型在仅使用规模为 $n$ 的小型数据集和快速优化过程的情况下取得了卓越性能。然而,扩散模型的估计误差受限于数据维度 $D$ 带来的维度诅咒 $n^{-1/D}$。由于图像通常是低维流形的并集,现有工作将数据建模为具有高斯隐变量的线性子空间并集,并实现了 $1/\sqrt{n}$ 的误差界。尽管该建模反映了多流形特性,但高斯隐变量无法捕捉隐流形的多模态特性。为弥补这一不足,我们提出低秩高斯混合的子空间混合(MoLR-MoG)建模方法,将目标数据建模为 $K$ 个线性子空间的并集,且每个子空间采用高斯混合隐变量(包含 $n_k$ 个维度为 $d_k$ 的模态)。基于此建模,对应的得分函数天然具有专家混合(MoE)结构,能够捕捉多模态信息并包含非线性特性。我们首先通过真实世界实验表明,MoE-隐变量MoG神经网络的生成结果显著优于MoE-隐变量高斯得分模型。此外,MoE-隐变量MoG神经网络仅使用MoE-隐变量Unet模型十分之一的参数量即可达到相当性能。这些结果表明MoLR-MoG建模具有合理性且适用于真实世界数据。基于此MoE-隐变量MoG得分函数,我们推导出 $R^4\sqrt{Σ_{k=1}^Kn_k}\sqrt{Σ_{k=1}^Kn_kd_k}/\sqrt{n}$ 的估计误差界,该界通过利用数据结构成功规避了维度诅咒。最后,我们研究优化过程并证明了MoLR-MoG建模下的收敛性保证。综合这些结果,在接近真实世界数据的设定下,本工作从理论上解释了为何扩散模型仅需少量训练样本且通过快速优化过程即可获得卓越性能。