Asymmetric Masked Distillation for Pre-Training Small Foundation Models

Self-supervised foundation models have shown great potential in computer vision thanks to the pre-training paradigm of masked autoencoding. Scale is a primary factor influencing the performance of these foundation models. However, these large foundation models often result in high computational cost that might limit their deployment. This paper focuses on pre-training relatively small vision transformer models that could be efficiently adapted to downstream tasks. Specifically, taking inspiration from knowledge distillation in model compression, we propose a new asymmetric masked distillation(AMD) framework for pre-training relatively small models with autoencoding. The core of AMD is to devise an asymmetric masking strategy, where the teacher model is enabled to see more context information with a lower masking ratio, while the student model still with high masking ratio to the original masked pre-training. We design customized multi-layer feature alignment between the teacher encoder and student encoder to regularize the pre-training of student MAE. To demonstrate the effectiveness and versatility of AMD, we apply it to both ImageMAE and VideoMAE for pre-training relatively small ViT models. AMD achieved 84.6% classification accuracy on IN1K using the ViT-B model. And AMD achieves 73.3% classification accuracy using the ViT-B model on the Something-in-Something V2 dataset, a 3.7% improvement over the original ViT-B model from VideoMAE. We also transfer AMD pre-trained models to downstream tasks and obtain consistent performance improvement over the standard pre-training.

翻译：自监督基础模型得益于掩码自编码预训练范式，在计算机视觉领域展现出巨大潜力。尺度是影响这些基础模型性能的主要因素，然而大规模基础模型往往伴随高昂的计算成本，可能限制其实际部署。本文聚焦于预训练相对小规模的视觉Transformer模型，使其能够高效适配下游任务。具体而言，受模型压缩中知识蒸馏技术的启发，我们提出了一种新的非对称掩码蒸馏（AMD）框架，用于通过自编码方式预训练小规模模型。AMD的核心在于设计非对称掩码策略——教师模型采用较低的掩码率以感知更多上下文信息，而学生模型仍保持高掩码率进行原始掩码预训练。我们在教师编码器与学生编码器之间设计了定制化的多层特征对齐方案，以规范学生MAE的预训练过程。为验证AMD的有效性与通用性，我们将其分别应用于ImageMAE和VideoMAE框架中预训练小规模ViT模型。采用AMD方法，ViT-B模型在IN1K数据集上达到84.6%的分类准确率；在Something-in-Something V2数据集上，ViT-B模型取得73.3%的分类准确率，较VideoMAE原始ViT-B模型提升3.7%。我们还将AMD预训练模型迁移至下游任务，获得了相较于标准预训练方法一致性的性能提升。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/