Generative AI has achieved remarkable empirical success, but from the perspective of statistics it often remains opaque: its predictions may be accurate, yet the underlying mechanism is difficult to interpret, analyze, and trust. This book reinterprets generative AI in the language of statistics, using flow matching as a central example. The key idea is that generative models should be understood not merely as devices for producing plausible data, but as methods for the nonparametric learning of high-dimensional probability distributions. From this viewpoint, missing-data imputation becomes principled sampling from learned conditional distributions, counterfactual analysis becomes the estimation of intervention distributions, and distributional dynamics become statistically analyzable objects. Mathematically, flow matching represents distributional deformation through the continuity equation and a time-dependent velocity field, thereby extending score matching from the learning of static score fields to the learning of transport paths themselves. Building on this foundation, the book develops a statistical framework in which generative models are used to estimate nuisance components while inferential validity is maintained through orthogonalization and cross-fitting in the spirit of double/debiased machine learning. Applications to survival analysis, censoring, missingness, and causal inference show how generative models can be integrated into statistical inference for structured high-dimensional problems.
翻译:生成式人工智能已取得显著的实证成功,但从统计学视角看,其机制往往仍不透明:预测结果可能准确,但底层原理难以解释、分析和信任。本书以流匹配为核心案例,用统计学语言重新阐释生成式人工智能。核心观点在于,生成模型不应仅被视为生成逼真数据的工具,而应理解为高维概率分布的非参数学习方法。基于这一视角,缺失数据填补可转化为从习得的条件分布中进行原则性采样,反事实分析可转化为干预分布的估计,而分布动态则成为可进行统计分析的数学对象。在数学上,流匹配通过连续性方程和时变速度场描述分布形变,从而将分数匹配从静态分数场的学习扩展至传输路径本身的学习。在此基础上,本书构建了一个统计框架:生成模型用于估计冗余分量,同时借鉴双重/去偏机器学习的思想,通过正交化与交叉拟合保持推断有效性。在生存分析、删失处理、缺失数据及因果推断等领域的应用表明,生成模型如何能整合到结构化高维问题的统计推断中。