Recent advances in self-supervised learning have highlighted the efficacy of data augmentation in learning data representation from unlabeled data. Training a linear model atop these enhanced representations can yield an adept classifier. Despite the remarkable empirical performance, the underlying mechanisms that enable data augmentation to unravel nonlinear data structures into linearly separable representations remain elusive. This paper seeks to bridge this gap by investigating under what conditions learned representations can linearly separate manifolds when data is drawn from a multi-manifold model. Our investigation reveals that data augmentation offers additional information beyond observed data and can thus improve the information-theoretic optimal rate of linear separation capacity. In particular, we show that self-supervised learning can linearly separate manifolds with a smaller distance than unsupervised learning, underscoring the additional benefits of data augmentation. Our theoretical analysis further underscores that the performance of downstream linear classifiers primarily hinges on the linear separability of data representations rather than the size of the labeled data set, reaffirming the viability of constructing efficient classifiers with limited labeled data amid an expansive unlabeled data set.
翻译:自监督学习的最新进展凸显了数据增强在从无标注数据中学习数据表征方面的有效性。基于这些增强表征训练线性模型能够产生熟练的分类器。尽管取得了显著的实证表现,但数据增强将非线性数据结构解构为线性可分离表征的内在机制仍不明确。本文旨在弥补这一空白,探究当数据来源于多流形模型时,学习到的表征在何种条件下能够实现流形的线性分离。我们的研究表明,数据增强提供了超越观测数据的额外信息,从而能够提升线性分离能力的信息论最优速率。特别地,我们证明自监督学习能够以比无监督学习更小的距离实现流形的线性分离,凸显了数据增强的额外优势。我们的理论分析进一步表明,下游线性分类器的性能主要取决于数据表征的线性可分离性,而非标注数据集的大小,这再次印证了在拥有大规模未标注数据的背景下用有限标注数据构建高效分类器的可行性。