We study the asymptotic error of score-based diffusion model sampling in large-sample scenarios from a non-parametric statistics perspective. We show that a kernel-based score estimator achieves an optimal mean square error of $\widetilde{O}\left(n^{-1} t^{-\frac{d+2}{2}}(t^{\frac{d}{2}} \vee 1)\right)$ for the score function of $p_0*\mathcal{N}(0,t\boldsymbol{I}_d)$, where $n$ and $d$ represent the sample size and the dimension, $t$ is bounded above and below by polynomials of $n$, and $p_0$ is an arbitrary sub-Gaussian distribution. As a consequence, this yields an $\widetilde{O}\left(n^{-1/2} t^{-\frac{d}{4}}\right)$ upper bound for the total variation error of the distribution of the sample generated by the diffusion model under a mere sub-Gaussian assumption. If in addition, $p_0$ belongs to the nonparametric family of the $\beta$-Sobolev space with $\beta\le 2$, by adopting an early stopping strategy, we obtain that the diffusion model is nearly (up to log factors) minimax optimal. This removes the crucial lower bound assumption on $p_0$ in previous proofs of the minimax optimality of the diffusion model for nonparametric families.
翻译:我们从非参数统计的视角,研究基于分数的扩散模型在大样本情形下采样的渐近误差。我们证明,对于 $p_0*\mathcal{N}(0,t\boldsymbol{I}_d)$ 的分数函数,一种基于核的分数估计器达到了 $\widetilde{O}\left(n^{-1} t^{-\frac{d+2}{2}}(t^{\frac{d}{2}} \vee 1)\right)$ 的最优均方误差,其中 $n$ 和 $d$ 分别表示样本量和维度,$t$ 被 $n$ 的多项式上下界所约束,且 $p_0$ 是任意亚高斯分布。由此,在仅满足亚高斯假设的条件下,这为扩散模型所生成样本的分布的总变差误差导出了一个 $\widetilde{O}\left(n^{-1/2} t^{-\frac{d}{4}}\right)$ 的上界。若进一步假设 $p_0$ 属于 $\beta$-Sobolev 空间(其中 $\beta\le 2$)这一非参数族,通过采用早停策略,我们得到扩散模型是近乎(在对数因子意义下)极小极大最优的。这移除了先前关于扩散模型在非参数族上极小极大最优性证明中对 $p_0$ 的关键下界假设。