Generative modeling is a widely-used machine learning method with various applications in scientific and industrial fields. Its primary objective is to simulate new examples drawn from an unknown distribution given training data while ensuring diversity and avoiding replication of examples from the training data. This paper presents theoretical insights into training a generative model with two properties: (i) the error of replacing the true data-generating distribution with the trained data-generating distribution should optimally converge to zero as the sample size approaches infinity, and (ii) the trained data-generating distribution should be far enough from any distribution replicating examples in the training data. We provide non-asymptotic results in the form of finite sample risk bounds that quantify these properties and depend on relevant parameters such as sample size, the dimension of the ambient space, and the dimension of the latent space. Our results are applicable to general integral probability metrics used to quantify errors in probability distribution spaces, with the Wasserstein-$1$ distance being the central example. We also include numerical examples to illustrate our theoretical findings.
翻译:生成式建模是一种广泛使用的机器学习方法,在科学和工业领域具有多种应用。其主要目标是在给定训练数据的情况下,模拟从未知分布中抽取的新样本,同时确保多样性并避免复制训练数据中的样本。本文提出了关于训练具备两个性质的生成模型的理论洞见:(i)用训练后的数据生成分布替代真实数据生成分布的误差应随样本量趋近无穷大而最优地收敛至零;(ii)训练后的数据生成分布应足够远离任何复制训练数据样本的分布。我们以有限样本风险界的形式提供了非渐近结果,这些结果量化了上述性质,并依赖于样本量、环境空间维度和潜在空间维度等相关参数。我们的结论适用于用于量化概率分布空间误差的通用积分概率度量,其中Wasserstein-1距离是核心示例。我们还提供了数值示例以说明理论发现。