Numerous self-supervised learning paradigms, such as contrastive learning and masked image modeling, have been proposed to acquire powerful and general representations from unlabeled data. However, these models are commonly pretrained within their specific framework alone, failing to consider the complementary nature of visual representations. To tackle this issue, we introduce Comprehensive Distillation with Multiple Self-supervised Teachers (DMT) for pretrained model compression, which leverages the strengths of multiple off-the-shelf self-supervised models. Our experimental results on prominent benchmark datasets exhibit that the proposed method significantly surpasses state-of-the-art competitors while retaining favorable efficiency metrics. On classification tasks, our DMT framework utilizing three different self-supervised ViT-Base teachers enhances the performance of both small/tiny models and the base model itself. For dense tasks, DMT elevates the AP/mIoU of standard SSL models on MS-COCO and ADE20K datasets by 4.0%.
翻译:诸多自监督学习范式(如对比学习和掩码图像建模)已被提出,旨在从无标签数据中获取通用且强大的表示。然而,这些模型通常仅在各自特定框架内进行预训练,未能考虑视觉表征的互补性。为解决该问题,我们提出基于多自监督教师的综合蒸馏方法(DMT)用于预训练模型压缩,该方法融合了多种现成自监督模型的优势。在主流基准数据集上的实验结果表明,所提方法在保持高效性的同时显著超越当前最优方法。在分类任务中,采用三种不同自监督ViT-Base教师的DMT框架,能够同时提升小/微型模型及基础模型本身的性能。针对密集预测任务,DMT在MS-COCO和ADE20K数据集上使标准自监督学习模型的AP/mIoU指标提升4.0%。