Driven by the success of Masked Language Modeling (MLM), the realm of self-supervised learning for computer vision has been invigorated by the central role of Masked Image Modeling (MIM) in driving recent breakthroughs. Notwithstanding the achievements of MIM across various downstream tasks, its overall efficiency is occasionally hampered by the lengthy duration of the pre-training phase. This paper presents a perspective that the optimization of masked tokens as a means of addressing the prevailing issue. Initially, we delve into an exploration of the inherent properties that a masked token ought to possess. Within the properties, we principally dedicated to articulating and emphasizing the `data singularity' attribute inherent in masked tokens. Through a comprehensive analysis of the heterogeneity between masked tokens and visible tokens within pre-trained models, we propose a novel approach termed masked token optimization (MTO), specifically designed to improve model efficiency through weight recalibration and the enhancement of the key property of masked tokens. The proposed method serves as an adaptable solution that seamlessly integrates into any MIM approach that leverages masked tokens. As a result, MTO achieves a considerable improvement in pre-training efficiency, resulting in an approximately 50% reduction in pre-training epochs required to attain converged performance of the recent approaches.
翻译:受掩码语言建模(MLM)成功推动,计算机视觉领域中自监督学习的活力因掩码图像建模(MIM)在近期突破中的核心作用而得以重振。尽管MIM在各类下游任务中取得了成就,但其整体效率偶尔受限于预训练阶段的漫长周期。本文提出一种视角:将掩码标记的优化作为应对这一普遍问题的手段。首先,我们深入探究了掩码标记应具备的内在属性。在这些属性中,我们着重阐述并强调了掩码标记所固有的“数据奇异性”特征。通过全面分析预训练模型中掩码标记与可见标记之间的异质性,我们提出了一种名为掩码标记优化(MTO)的新方法,旨在通过权重重新校准和增强掩码标记的关键属性来提升模型效率。该方法作为一种自适应解决方案,可无缝集成到任何利用掩码标记的MIM方法中。实验结果表明,MTO在预训练效率上取得了显著提升,使达到近期方法收敛性能所需的预训练轮次减少了约50%。