Data augmentation is a widely used technique and an essential ingredient in the recent advance in self-supervised representation learning. By preserving the similarity between augmented data, the resulting data representation can improve various downstream analyses and achieve state-of-the-art performance in many applications. Despite the empirical effectiveness, most existing methods lack theoretical understanding under a general nonlinear setting. To fill this gap, we develop a statistical framework on a low-dimension product manifold to model the data augmentation transformation. Under this framework, we introduce a new representation learning method called augmentation invariant manifold learning and design a computationally efficient algorithm by reformulating it as a stochastic optimization problem. Compared with existing self-supervised methods, the new method simultaneously exploits the manifold's geometric structure and invariant property of augmented data and has an explicit theoretical guarantee. Our theoretical investigation characterizes the role of data augmentation in the proposed method and reveals why and how the data representation learned from augmented data can improve the $k$-nearest neighbor classifier in the downstream analysis, showing that a more complex data augmentation leads to more improvement in downstream analysis. Finally, numerical experiments on simulated and real datasets are presented to demonstrate the merit of the proposed method.
翻译:数据增强是一种广泛应用的技术,也是近期自监督表示学习进展中的关键要素。通过保持增强数据间的相似性,所得到的数据表示能够提升多种下游分析任务,并在众多应用中达到最先进性能。尽管在实证层面效果显著,但现有方法大多缺乏一般非线性设定下的理论理解。为填补这一空白,我们在低维乘积流形上构建了一个统计框架来建模数据增强变换。在该框架下,我们提出了一种名为“增广不变流形学习”的新型表示学习方法,并通过将其重构为随机优化问题设计了高效的计算算法。与现有自监督方法相比,新方法同时利用了流形的几何结构与增强数据的不变性,并具备明确的理论保证。我们的理论分析刻画了数据增强在该方法中的作用,揭示了为何以及如何通过增强数据学习到的表示能提升下游分析中的$k$近邻分类器性能:更复杂的数据增强将带来更显著的下游分析改进。最后,在模拟与真实数据集上的数值实验验证了所提方法的优越性。