We study the problem of estimating the score function of an unknown probability distribution $\rho^*$ from $n$ independent and identically distributed observations in $d$ dimensions. Assuming that $\rho^*$ is subgaussian and has a Lipschitz-continuous score function $s^*$, we establish the optimal rate of $\tilde \Theta(n^{-\frac{2}{d+4}})$ for this estimation problem under the loss function $\|\hat s - s^*\|^2_{L^2(\rho^*)}$ that is commonly used in the score matching literature, highlighting the curse of dimensionality where sample complexity for accurate score estimation grows exponentially with the dimension $d$. Leveraging key insights in empirical Bayes theory as well as a new convergence rate of smoothed empirical distribution in Hellinger distance, we show that a regularized score estimator based on a Gaussian kernel attains this rate, shown optimal by a matching minimax lower bound. We also discuss the implication of our theory on the sample complexity of score-based generative models.
翻译:我们研究从d维空间中n个独立同分布观测样本中估计未知概率分布ρ*的得分函数问题。假设ρ*为次高斯分布且具有Lipschitz连续的得分函数s*,我们在得分匹配文献中常用的损失函数‖ŝ - s*‖²_L²(ρ*)下,建立了该估计问题的最优收敛速率Θ̃(n^{-2/(d+4)})。这一结果凸显了维数灾难效应——准确得分估计所需样本量随维度d呈指数增长。通过利用经验贝叶斯理论的关键见解以及Hellinger距离度量下平滑经验分布的新收敛速率,我们证明了基于高斯核的正则化得分估计器能够达到该速率,并通过匹配的极小化最优下界论证其最优性。最后,我们讨论了该理论对基于得分的生成模型样本复杂度的影响。