Equivariant score-based generative models provably learn distributions with symmetries efficiently

Symmetry is ubiquitous in many real-world phenomena and tasks, such as physics, images, and molecular simulations. Empirical studies have demonstrated that incorporating symmetries into generative models can provide better generalization and sampling efficiency when the underlying data distribution has group symmetry. In this work, we provide the first theoretical analysis and guarantees of score-based generative models (SGMs) for learning distributions that are invariant with respect to some group symmetry and offer the first quantitative comparison between data augmentation and adding equivariant inductive bias. First, building on recent works on the Wasserstein-1 ($\mathbf{d}_1$) guarantees of SGMs and empirical estimations of probability divergences under group symmetry, we provide an improved $\mathbf{d}_1$ generalization bound when the data distribution is group-invariant. Second, we describe the inductive bias of equivariant SGMs using Hamilton-Jacobi-Bellman theory, and rigorously demonstrate that one can learn the score of a symmetrized distribution using equivariant vector fields without data augmentations through the analysis of the optimality and equivalence of score-matching objectives. This also provides practical guidance that one does not have to augment the dataset as long as the vector field or the neural network parametrization is equivariant. Moreover, we quantify the impact of not incorporating equivariant structure into the score parametrization, by showing that non-equivariant vector fields can yield worse generalization bounds. This can be viewed as a type of model-form error that describes the missing structure of non-equivariant vector fields. Numerical simulations corroborate our analysis and highlight that data augmentations cannot replace the role of equivariant vector fields.

翻译：对称性在诸多现实世界现象与任务中普遍存在，例如物理学、图像处理与分子模拟等领域。实证研究表明，当底层数据分布具有群对称性时，将对称性融入生成模型能够获得更好的泛化性能与采样效率。本文首次对学习具有群不变性分布的分数生成模型进行了理论分析与保证，并对数据增强与引入等变归纳偏置两种方法进行了首次定量比较。首先，基于近期关于分数生成模型在Wasserstein-1距离（$\mathbf{d}_1$）上的理论保证以及群对称性下概率散度的经验估计研究，我们为群不变数据分布提供了改进的$\mathbf{d}_1$泛化界。其次，我们运用Hamilton-Jacobi-Bellman理论描述了等变分数生成模型的归纳偏置，并通过分析分数匹配目标的最优性与等价性，严格证明了无需数据增强即可利用等变向量场学习对称化分布的分数函数。这同时提供了重要实践指导：只要向量场或神经网络参数化具有等变性，就无需进行数据增强。此外，我们通过证明非等变向量场可能导致更差的泛化界，量化了未在分数参数化中融入等变结构的影响。这可视为一种描述非等变向量场结构缺失的模型形式误差。数值模拟验证了我们的理论分析，并突显了数据增强无法替代等变向量场的作用。