Distribution learning via neural differential equations: a nonparametric statistical perspective

Ordinary differential equations (ODEs), via their induced flow maps, provide a powerful framework to parameterize invertible transformations for the purpose of representing complex probability distributions. While such models have achieved enormous success in machine learning, particularly for generative modeling and density estimation, little is known about their statistical properties. This work establishes the first general nonparametric statistical convergence analysis for distribution learning via ODE models trained through likelihood maximization. We first prove a convergence theorem applicable to arbitrary velocity field classes $\mathcal{F}$ satisfying certain simple boundary constraints. This general result captures the trade-off between approximation error (`bias') and the complexity of the ODE model (`variance'). We show that the latter can be quantified via the $C^1$-metric entropy of the class $\mathcal F$. We then apply this general framework to the setting of $C^k$-smooth target densities, and establish nearly minimax-optimal convergence rates for two relevant velocity field classes $\mathcal F$: $C^k$ functions and neural networks. The latter is the practically important case of neural ODEs. Our proof techniques require a careful synthesis of (i) analytical stability results for ODEs, (ii) classical theory for sieved M-estimators, and (iii) recent results on approximation rates and metric entropies of neural network classes. The results also provide theoretical insight on how the choice of velocity field class, and the dependence of this choice on sample size $n$ (e.g., the scaling of width, depth, and sparsity of neural network classes), impacts statistical performance.

翻译：常微分方程(ODE)通过其诱导的流映射，为参数化可逆变换以表示复杂概率分布提供了强大框架。尽管这类模型在机器学习领域（尤其在生成建模和密度估计中）取得了巨大成功，但其统计性质仍鲜为人知。本文首次建立了基于极大似然训练的ODE模型进行分布学习的通用非参数统计收敛性分析。我们首先证明了一个适用于任意满足特定简单边界约束的速度场类别$\mathcal{F}$的收敛定理。该一般性结果刻画了近似误差（"偏差"）与ODE模型复杂性（"方差"）之间的权衡。我们表明，后者可通过类别$\mathcal{F}$的$C^1$度量熵进行量化。随后，我们将该通用框架应用于$C^k$光滑目标密度场景，并为两个相关速度场类别$\mathcal{F}$：$C^k$函数和神经网络建立了近乎极小极大最优的收敛速率。后者对应实际重要的神经ODE情形。我们的证明技术需要精巧地综合以下三方面理论：(i) ODE的解析稳定性结果，(ii)筛M估计量的经典理论，以及(iii)神经网络类近似率与度量熵的最新成果。该结果还从理论上揭示了速度场类别的选择及其对样本量$n$的依赖关系（例如神经网络类的宽度、深度与稀疏度的缩放）如何影响统计性能。