Masked Autoencoder (MAE) pre-training of vision transformers (ViTs) yields strong performance in low-label data regimes but comes with substantial computational costs, making it impractical in time- and resource-constrained industrial settings. We address this by integrating Decorrelated Backpropagation (DBP) into MAE pre-training, an optimization method that iteratively reduces input correlations at each layer to accelerate convergence. Applied selectively to the encoder, DBP achieves faster pre-training without loss of stability. To mimic constrained-data scenarios, we evaluate our approach on ImageNet-1K pre-training and ADE20K fine-tuning using randomly sampled subsets of each dataset. Under this setting, DBP-MAE reduces wall-clock time to baseline performance by 21.1%, lowers carbon emissions by 21.4%, and improves segmentation mIoU by 1.1 points. We observe similar gains when pre-training and fine-tuning on proprietary industrial data, confirming the method's applicability in real-world scenarios. These results demonstrate that DBP can reduce training time and energy use while improving downstream performance for large-scale ViT pre-training. Keywords: Deep learning, Vision transformers, Efficient AI, Decorrelation
翻译:掩码自编码器(MAE)预训练视觉Transformer(ViT)在低标注数据场景下表现出色,但其计算成本高昂,在时间和资源受限的工业环境中难以实际应用。本文通过将去相关反向传播(DBP)整合到MAE预训练中来解决此问题——该方法通过逐层迭代降低输入相关性以加速收敛。选择性应用于编码器时,DBP可在保持稳定性的前提下实现更快的预训练。为模拟受限数据场景,我们在ImageNet-1K预训练和ADE20K微调任务中,使用各数据集的随机采样子集进行评估。在此设定下,DBP-MAE将达到基线性能所需的实际训练时间减少21.1%,碳排放降低21.4%,分割任务mIoU提升1.1个百分点。在私有工业数据上进行预训练和微调时,我们观察到类似的增益,证实了该方法在真实场景中的适用性。这些结果表明,DBP能够在大规模ViT预训练中减少训练时间和能耗,同时提升下游任务性能。关键词:深度学习,视觉Transformer,高效人工智能,去相关