EfficientTrain: Exploring Generalized Curriculum Learning for Training Visual Backbones

The superior performance of modern deep networks usually comes with a costly training procedure. This paper presents a new curriculum learning approach for the efficient training of visual backbones (e.g., vision Transformers). Our work is inspired by the inherent learning dynamics of deep networks: we experimentally show that at an earlier training stage, the model mainly learns to recognize some 'easier-to-learn' discriminative patterns within each example, e.g., the lower-frequency components of images and the original information before data augmentation. Driven by this phenomenon, we propose a curriculum where the model always leverages all the training data at each epoch, while the curriculum starts with only exposing the 'easier-to-learn' patterns of each example, and introduces gradually more difficult patterns. To implement this idea, we 1) introduce a cropping operation in the Fourier spectrum of the inputs, which enables the model to learn from only the lower-frequency components efficiently, 2) demonstrate that exposing the features of original images amounts to adopting weaker data augmentation, and 3) integrate 1) and 2) and design a curriculum learning schedule with a greedy-search algorithm. The resulting approach, EfficientTrain, is simple, general, yet surprisingly effective. As an off-the-shelf method, it reduces the wall-time training cost of a wide variety of popular models (e.g., ResNet, ConvNeXt, DeiT, PVT, Swin, and CSWin) by >1.5x on ImageNet-1K/22K without sacrificing accuracy. It is also effective for self-supervised learning (e.g., MAE). Code is available at https://github.com/LeapLabTHU/EfficientTrain.

翻译：现代深度网络的卓越性能通常伴随着昂贵的训练过程。本文提出了一种新的课程学习方法，用于高效训练视觉骨干网络（例如，视觉Transformer）。我们的工作受到深度网络固有学习动态的启发：实验表明，在训练早期阶段，模型主要学习识别每个样本中某些"更易学习"的判别性模式，例如图像的低频分量以及数据增强前的原始信息。基于这一现象，我们提出了一种课程学习策略，其中模型在每个epoch中始终利用所有训练数据，但课程开始时仅暴露每个样本的"更易学习"模式，并逐步引入更困难的模式。为实现这一想法，我们（1）引入输入傅里叶频谱中的裁剪操作，使模型能够仅从低频分量高效学习；（2）证明暴露原始图像特征等同于采用更弱的数据增强；（3）整合（1）和（2），并结合贪心搜索算法设计课程学习计划。由此得到的方法EfficientTrain简单、通用且效果显著。作为一种即插即用方法，它在ImageNet-1K/22K上可将多种流行模型（如ResNet、ConvNeXt、DeiT、PVT、Swin和CSWin）的训练时间成本降低超过1.5倍，且不牺牲精度。该方法对自监督学习（例如MAE）同样有效。代码已开源在https://github.com/LeapLabTHU/EfficientTrain。