The success of large generative models has driven a paradigm shift, leveraging massive multi-source data to enhance model capabilities. However, the interaction among these sources remains theoretically underexplored. This paper takes the first step toward a rigorous analysis of multi-source training in conditional generative modeling, where each condition represents a distinct data source. Specifically, we establish a general distribution estimation error bound in average total variation distance for conditional maximum likelihood estimation based on the bracketing number. Our result shows that when source distributions share certain similarities and the model is expressive enough, multi-source training guarantees a sharper bound than single-source training. We further instantiate the general theory on conditional Gaussian estimation and deep generative models including autoregressive and flexible energy-based models, by characterizing their bracketing numbers. The results highlight that the number of sources and similarity among source distributions improve the advantage of multi-source training. Simulations and real-world experiments validate our theory. Code is available at: \url{https://github.com/ML-GSAI/Multi-Source-GM}.
翻译:大型生成模型的成功推动了范式转变,即利用海量多源数据来增强模型能力。然而,这些数据源之间的相互作用在理论上仍未得到充分探索。本文在条件生成建模中对多源训练进行了严格分析的第一步,其中每个条件代表一个不同的数据源。具体而言,我们基于括号数建立了条件最大似然估计在平均总变差距离下的一般分布估计误差界。我们的结果表明,当源分布具有某些相似性且模型表达能力足够强时,多源训练能保证比单源训练更紧的误差界。我们进一步通过刻画其括号数,将一般理论实例化到条件高斯估计以及包括自回归模型和灵活能量模型在内的深度生成模型上。结果突出表明,数据源的数量以及源分布之间的相似性增强了多源训练的优势。仿真实验和真实世界实验验证了我们的理论。代码发布于:\url{https://github.com/ML-GSAI/Multi-Source-GM}。