Img2Vec: A Teacher of High Token-Diversity Helps Masked AutoEncoders

We present a pipeline of Image to Vector (Img2Vec) for masked image modeling (MIM) with deep features. To study which type of deep features is appropriate for MIM as a learning target, we propose a simple MIM framework with serials of well-trained self-supervised models to convert an Image to a feature Vector as the learning target of MIM, where the feature extractor is also known as a teacher model. Surprisingly, we empirically find that an MIM model benefits more from image features generated by some lighter models (e.g., ResNet-50, 26M) than from those by a cumbersome teacher like Transformer-based models (e.g., ViT-Large, 307M). To analyze this remarkable phenomenon, we devise a novel attribute, token diversity, to evaluate the characteristics of generated features from different models. Token diversity measures the feature dissimilarity among different tokens. Through extensive experiments and visualizations, we hypothesize that beyond the acknowledgment that a large model can improve MIM, a high token-diversity of a teacher model is also crucial. Based on the above discussion, Img2Vec adopts a teacher model with high token-diversity to generate image features. Img2Vec pre-trained on ImageNet unlabeled data with ViT-B yields 85.1\% top-1 accuracy on fine-tuning. Moreover, we scale up Img2Vec on larger models, ViT-L and ViT-H, and get $86.7\%$ and $87.5\%$ accuracy respectively. It also achieves state-of-the-art results on other downstream tasks, e.g., 51.8\% mAP on COCO and 50.7\% mIoU on ADE20K. Img2Vec is a simple yet effective framework tailored to deep feature MIM learning, accomplishing superb comprehensive performance on representative vision tasks.

翻译：我们提出了一种用于掩码图像建模（MIM）的深度特征图像到向量（Img2Vec）流水线。为探究何种深度特征适合作为MIM的学习目标，我们设计了一个简单的MIM框架，利用一系列预训练的自监督模型将图像转换为特征向量作为MIM的学习目标（该特征提取器亦称为教师模型）。令人惊讶的是，实验发现MIM模型从较轻量模型（如ResNet-50，26M参数）生成的图像特征中获益更多，而非来自笨重的Transformer类教师模型（如ViT-Large，307M参数）。为分析这一显著现象，我们提出了一种名为"令牌多样性"的新属性，用于评估不同模型生成的特征特性。令牌多样性衡量不同令牌间的特征差异性。通过大量实验与可视化分析，我们假设：除公认的大模型能提升MIM性能外，教师模型的高令牌多样性同样至关重要。基于上述讨论，Img2Vec采用具备高令牌多样性的教师模型生成图像特征。Img2Vec在ImageNet无标签数据上以ViT-B为骨干预训练后，微调所得分类准确率达85.1%。此外，我们在更大模型（ViT-L和ViT-H）上扩展Img2Vec，分别获得86.7%和87.5%的准确率。该框架在COCO（mAP 51.8%）和ADE20K（mIoU 50.7%）等其他下游任务中也取得了最优结果。Img2Vec是一个专为深度特征MIM学习设计的简洁高效框架，在代表性视觉任务上实现了卓越的综合性能。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

最新《Transformers模型》教程，64页ppt

专知会员服务

326+阅读 · 2020年11月26日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

46+阅读 · 2020年10月31日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日