Masked Autoencoders (MAEs) achieve impressive performance in image classification tasks, yet the internal representations they learn remain less understood. This work started as an attempt to understand the strong downstream classification performance of MAE. In this process we discover that representations learned with the pretraining and fine-tuning, are quite robust - demonstrating a good classification performance in the presence of degradations, such as blur and occlusions. Through layer-wise analysis of token embeddings, we show that pretrained MAE progressively constructs its latent space in a class-aware manner across network depth: embeddings from different classes lie in subspaces that become increasingly separable. We further observe that MAE exhibits early and persistent global attention across encoder layers, in contrast to standard Vision Transformers (ViTs). To quantify feature robustness, we introduce two sensitivity indicators: directional alignment between clean and perturbed embeddings, and head-wise retention of active features under degradations. These studies help establish the robust classification performance of MAEs.
翻译:掩蔽自编码器(MAE)在图像分类任务中取得了令人瞩目的性能,但其学习的内部表示机制仍未得到充分理解。本研究最初旨在探究MAE在下游分类任务中表现优异的原因。在此过程中,我们发现通过预训练和微调学习到的表示具有显著的鲁棒性——在模糊和遮挡等退化条件下仍能保持良好的分类性能。通过对令牌嵌入进行逐层分析,我们证明预训练的MAE能够以类感知的方式沿网络深度渐进构建其潜在空间:来自不同类别的嵌入位于逐渐可分离的子空间中。我们进一步观察到,与标准视觉Transformer(ViT)相比,MAE在编码器各层均表现出早期且持续的全局注意力机制。为量化特征鲁棒性,我们引入了两个敏感性指标:干净样本与扰动样本嵌入之间的方向对齐度,以及退化条件下主动特征的逐头保留率。这些研究有助于确立MAE的鲁棒分类性能。