Images suffer from heavy spatial redundancy because pixels in neighboring regions are spatially correlated. Existing approaches strive to overcome this limitation by reducing less meaningful image regions. However, current leading methods rely on supervisory signals. They may compel models to preserve content that aligns with labeled categories and discard content belonging to unlabeled categories. This categorical inductive bias makes these methods less effective in real-world scenarios. To address this issue, we propose a self-supervised framework for image redundancy reduction called Learning to Rank Patches (LTRP). We observe that image reconstruction of masked image modeling models is sensitive to the removal of visible patches when the masking ratio is high (e.g., 90\%). Building upon it, we implement LTRP via two steps: inferring the semantic density score of each patch by quantifying variation between reconstructions with and without this patch, and learning to rank the patches with the pseudo score. The entire process is self-supervised, thus getting out of the dilemma of categorical inductive bias. We design extensive experiments on different datasets and tasks. The results demonstrate that LTRP outperforms both supervised and other self-supervised methods due to the fair assessment of image content.
翻译:图像因相邻区域像素在空间上存在相关性而具有严重的空间冗余。现有方法通过减少意义较弱的图像区域来克服这一局限。然而,当前主流方法依赖监督信号,可能迫使模型保留与标注类别对齐的内容,而丢弃属于未标注类别的内容。这种类别归纳偏置导致这些方法在真实场景中效果欠佳。为解决该问题,我们提出一种名为“学习对图像块进行排序”(LTRP)的自监督图像冗余缩减框架。我们观察到,当掩码比例较高(如90%)时,掩码图像建模模型的图像重建对可见图像块的移除具有敏感性。基于此,我们通过两步实现LTRP:首先通过量化有无该图像块时的重建差异推断每个图像块的语义密度分数,然后利用伪分数学习对图像块进行排序。整个过程为自监督方式,从而摆脱了类别归纳偏置困境。我们在不同数据集和任务上设计了大量实验。结果表明,由于对图像内容进行了公平评估,LTRP在性能上优于监督方法及其他自监督方法。