In this work, we explore the theoretical properties of conditional deep generative models under the statistical framework of distribution regression where the response variable lies in a high-dimensional ambient space but concentrates around a potentially lower-dimensional manifold. More specifically, we study the large-sample properties of a likelihood-based approach for estimating these models. Our results lead to the convergence rate of a sieve maximum likelihood estimator (MLE) for estimating the conditional distribution (and its devolved counterpart) of the response given predictors in the Hellinger (Wasserstein) metric. Our rates depend solely on the intrinsic dimension and smoothness of the true conditional distribution. These findings provide an explanation of why conditional deep generative models can circumvent the curse of dimensionality from the perspective of statistical foundations and demonstrate that they can learn a broader class of nearly singular conditional distributions. Our analysis also emphasizes the importance of introducing a small noise perturbation to the data when they are supported sufficiently close to a manifold. Finally, in our numerical studies, we demonstrate the effective implementation of the proposed approach using both synthetic and real-world datasets, which also provide complementary validation to our theoretical findings.
翻译:在本研究中,我们基于分布回归的统计框架,探讨了条件深度生成模型的理论性质,其中响应变量位于高维环境空间中,但集中在潜在的低维流形附近。具体而言,我们研究了基于似然的方法估计这些模型的大样本性质。我们的结果导出了筛极大似然估计量(MLE)在Hellinger(Wasserstein)度量下估计给定预测变量的条件分布(及其退化对应形式)的收敛速率。该速率仅取决于真实条件分布的内在维数和平滑度。这些发现从统计基础的角度解释了条件深度生成模型为何能够规避维数灾难,并证明它们能够学习更广泛的近似奇异条件分布类。我们的分析还强调了当数据支撑集充分接近流形时,对数据引入微小噪声扰动的重要性。最后,在数值研究中,我们通过合成数据集和真实数据集展示了所提方法的有效实现,这为我们的理论发现提供了补充验证。