Symmetry-aware methods for machine learning, such as data augmentation and equivariant architectures, encourage correct model behavior on all transformations (e.g. rotations or permutations) of the original dataset. These methods can improve generalization and sample efficiency, under the assumption that the transformed datapoints are highly probable, or "important", under the test distribution. In this work, we develop a method for critically evaluating this assumption. In particular, we propose a metric to quantify the amount of symmetry breaking in a dataset, via a two-sample classifier test that distinguishes between the original dataset and its randomly augmented equivalent. We validate our metric on synthetic datasets, and then use it to uncover surprisingly high degrees of symmetry-breaking in several benchmark point cloud datasets, constituting a severe form of dataset bias. We show theoretically that distributional symmetry-breaking can prevent invariant methods from performing optimally even when the underlying labels are truly invariant, for invariant ridge regression in the infinite feature limit. Empirically, the implication for symmetry-aware methods is dataset-dependent: equivariant methods still impart benefits on some symmetry-biased datasets, but not others, particularly when the symmetry bias is predictive of the labels. Overall, these findings suggest that understanding equivariance -- both when it works, and why -- may require rethinking symmetry biases in the data.
翻译:针对机器学习的对称性感知方法(如数据增强和等变架构)旨在鼓励模型对原始数据集的所有变换(例如旋转或置换)产生正确行为。这些方法能够提升泛化能力与样本效率,其前提假设是变换后的数据点在测试分布下具有高概率或“重要性”。在本工作中,我们开发了一种方法以批判性评估这一假设。具体而言,我们提出了一种指标,通过区分原始数据集及其随机增强等价物的双样本分类器检验,量化数据集中对称性破缺的程度。我们在合成数据集上验证该指标,随后利用它在多个基准点云数据集中发现出人意料的高程度对称性破缺,这构成了严重的数据集偏差。我们理论上证明,对于无限特征极限下的不变岭回归,即使底层标签真正具有不变性,分布对称性破缺也会阻碍不变方法达到最优性能。实证表明,对对称性感知方法的影响取决于数据集:等变方法在某些对称性偏差数据集中仍能带来益处,但在其他数据集中则不然——尤其当对称性偏差可预测标签时。总体而言,这些发现表明,理解等变性(包括其有效性与失效原因)可能需要重新审视数据中的对称性偏差。