Data, the seminal opportunity and challenge in modern machine learning, currently constrains the scalability of representation learning and impedes the pace of model evolution. In this work, we investigate the efficiency properties of data from both optimization and generalization perspectives. Our theoretical and empirical analysis reveals an unexpected finding: for a given task, utilizing a publicly available, task- and architecture-agnostic model (referred to as the `prior model' in this paper) can effectively produce efficient data. Building on this insight, we propose the Representation Learning Accelerator (\algopt), which promotes the formation and utilization of efficient data, thereby accelerating representation learning. Utilizing a ResNet-18 pre-trained on CIFAR-10 as a prior model to inform ResNet-50 training on ImageNet-1K reduces computational costs by 50% while maintaining the same accuracy as the model trained with the original BYOL, which requires 100% cost. Our code is available at: \url{https://github.com/LINs-lab/ReLA}.
翻译:数据作为现代机器学习中既带来重大机遇又构成严峻挑战的核心要素,当前制约着表示学习的可扩展性并阻碍着模型演进的步伐。本研究从优化与泛化双重视角系统探究数据的效率特性。理论与实证分析揭示了一个意外发现:对于给定任务,利用公开可获取的、与任务及架构无关的模型(本文称为“先验模型”)能够有效生成高效数据。基于此洞见,我们提出表示学习加速器(\algopt),通过促进高效数据的构建与利用来加速表示学习过程。以在CIFAR-10上预训练的ResNet-18作为先验模型指导ResNet-50在ImageNet-1K上的训练,在保持与原始BYOL(需100%计算成本)相同精度的前提下,可降低50%计算成本。代码已开源:\url{https://github.com/LINs-lab/ReLA}。