Self-supervised foundation models have shown great potential in computer vision thanks to the pre-training paradigm of masked autoencoding. Scale is a primary factor influencing the performance of these foundation models. However, these large foundation models often result in high computational cost that might limit their deployment. This paper focuses on pre-training relatively small vision transformer models that could be efficiently adapted to downstream tasks. Specifically, taking inspiration from knowledge distillation in model compression, we propose a new asymmetric masked distillation(AMD) framework for pre-training relatively small models with autoencoding. The core of AMD is to devise an asymmetric masking strategy, where the teacher model is enabled to see more context information with a lower masking ratio, while the student model still with high masking ratio to the original masked pre-training. We design customized multi-layer feature alignment between the teacher encoder and student encoder to regularize the pre-training of student MAE. To demonstrate the effectiveness and versatility of AMD, we apply it to both ImageMAE and VideoMAE for pre-training relatively small ViT models. AMD achieved 84.6% classification accuracy on IN1K using the ViT-B model. And AMD achieves 73.3% classification accuracy using the ViT-B model on the Something-in-Something V2 dataset, a 3.7% improvement over the original ViT-B model from VideoMAE. We also transfer AMD pre-trained models to downstream tasks and obtain consistent performance improvement over the standard pre-training.
翻译:自监督基础模型得益于掩码自编码预训练范式,在计算机视觉领域展现出巨大潜力。尺度是影响这些基础模型性能的主要因素,然而大规模基础模型往往伴随高昂的计算成本,可能限制其实际部署。本文聚焦于预训练相对小规模的视觉Transformer模型,使其能够高效适配下游任务。具体而言,受模型压缩中知识蒸馏技术的启发,我们提出了一种新的非对称掩码蒸馏(AMD)框架,用于通过自编码方式预训练小规模模型。AMD的核心在于设计非对称掩码策略——教师模型采用较低的掩码率以感知更多上下文信息,而学生模型仍保持高掩码率进行原始掩码预训练。我们在教师编码器与学生编码器之间设计了定制化的多层特征对齐方案,以规范学生MAE的预训练过程。为验证AMD的有效性与通用性,我们将其分别应用于ImageMAE和VideoMAE框架中预训练小规模ViT模型。采用AMD方法,ViT-B模型在IN1K数据集上达到84.6%的分类准确率;在Something-in-Something V2数据集上,ViT-B模型取得73.3%的分类准确率,较VideoMAE原始ViT-B模型提升3.7%。我们还将AMD预训练模型迁移至下游任务,获得了相较于标准预训练方法一致性的性能提升。