Deep generative models often face a fundamental tradeoff: high sample quality can come at the cost of memorisation, where the model reproduces training data rather than generalising across the underlying data geometry. We introduce Carré du champ flow matching (CDC-FM), a generalisation of flow matching (FM), that improves the quality-generalisation tradeoff by regularising the probability path with a geometry-aware noise. Our method replaces the homogeneous, isotropic noise in FM with a spatially varying, anisotropic Gaussian noise whose covariance captures the local geometry of the latent data manifold. We prove that this geometric noise can be optimally estimated from the data and is scalable to large data. Further, we provide an extensive experimental evaluation on diverse datasets (synthetic manifolds, point clouds, single-cell genomics, animal motion capture, and images) as well as various neural network architectures (MLPs, CNNs, and transformers). We demonstrate that CDC-FM consistently offers a better quality-generalisation tradeoff. We observe significant improvements over standard FM in data-scarce regimes and in highly non-uniformly sampled datasets, which are often encountered in AI for science applications. Our work provides a mathematical framework for studying the interplay between data geometry, generalisation and memorisation in generative models, as well as a robust and scalable algorithm that can be readily integrated into existing flow matching pipelines.
翻译:深度生成模型常面临一个基本权衡:高样本质量可能以记忆化为代价,此时模型会复现训练数据而非泛化底层数据几何结构。我们提出卡雷·杜尚流匹配(CDC-FM),作为流匹配(FM)的推广方法,通过采用几何感知噪声对概率路径进行正则化,从而改善质量与泛化性的权衡关系。本方法将FM中均匀各向同性的噪声替换为空间变化、各向异性的高斯噪声,其协方差矩阵能够捕捉潜在数据流形的局部几何特征。我们证明这种几何噪声可从数据中实现最优估计,并具备大规模数据扩展能力。此外,我们在多样化数据集(合成流形、点云、单细胞基因组学、动物运动捕捉数据及图像)及多种神经网络架构(MLP、CNN和Transformer)上进行了广泛实验评估。实验表明CDC-FM始终能提供更优的质量-泛化权衡。在数据稀缺场景和高度非均匀采样的数据集中——这两类情况在科学人工智能应用中尤为常见——我们观察到CDC-FM相较标准FM有显著提升。本研究为探索生成模型中数据几何结构、泛化性与记忆化之间的相互作用提供了数学框架,同时提出了一种可无缝集成至现有流匹配流程的鲁棒且可扩展的算法。