Masked Autoencoder (MAE) is a notable method for self-supervised pretraining in visual representation learning. It operates by randomly masking image patches and reconstructing these masked patches using the unmasked ones. A key limitation of MAE lies in its disregard for the varying informativeness of different patches, as it uniformly selects patches to mask. To overcome this, some approaches propose masking based on patch informativeness. However, these methods often do not consider the specific requirements of downstream tasks, potentially leading to suboptimal representations for these tasks. In response, we introduce the Multi-level Optimized Mask Autoencoder (MLO-MAE), a novel framework that leverages end-to-end feedback from downstream tasks to learn an optimal masking strategy during pretraining. Our experimental findings highlight MLO-MAE's significant advancements in visual representation learning. Compared to existing methods, it demonstrates remarkable improvements across diverse datasets and tasks, showcasing its adaptability and efficiency. Our code is available at: https://github.com/Alexiland/MLOMAE
翻译:掩码自编码器(MAE)是视觉表征学习中一种知名的自监督预训练方法。其核心机制通过随机掩蔽图像块,并利用未掩蔽区域重建这些掩蔽块。MAE的一个关键局限性在于其忽略不同图像块信息量差异——它采用均匀采样策略选择掩蔽区域。为克服此问题,部分研究提出基于图像块信息量的掩蔽方法。然而,这些方法往往未考虑下游任务的特定需求,可能导致为任务生成的表征存在次优性。为此,我们提出多级优化掩码自编码器(MLO-MAE)——一种新颖框架,该框架利用下游任务的端到端反馈,在预训练阶段学习最优掩蔽策略。实验结果表明,MLO-MAE在视觉表征学习方面取得了显著进展。与现有方法相比,它在不同数据集和任务上均展现出卓越性能提升,充分体现了其适应性与高效性。我们的代码已开源:https://github.com/Alexiland/MLOMAE