The development of autoregressive modeling (AM) in computer vision lags behind natural language processing (NLP) in self-supervised pre-training. This is mainly caused by the challenge that images are not sequential signals and lack a natural order when applying autoregressive modeling. In this study, inspired by human beings' way of grasping an image, i.e., focusing on the main object first, we present a semantic-aware autoregressive image modeling (SemAIM) method to tackle this challenge. The key insight of SemAIM is to autoregressive model images from the semantic patches to the less semantic patches. To this end, we first calculate a semantic-aware permutation of patches according to their feature similarities and then perform the autoregression procedure based on the permutation. In addition, considering that the raw pixels of patches are low-level signals and are not ideal prediction targets for learning high-level semantic representation, we also explore utilizing the patch features as the prediction targets. Extensive experiments are conducted on a broad range of downstream tasks, including image classification, object detection, and instance/semantic segmentation, to evaluate the performance of SemAIM. The results demonstrate SemAIM achieves state-of-the-art performance compared with other self-supervised methods. Specifically, with ViT-B, SemAIM achieves 84.1% top-1 accuracy for fine-tuning on ImageNet, 51.3% AP and 45.4% AP for object detection and instance segmentation on COCO, which outperforms the vanilla MAE by 0.5%, 1.0%, and 0.5%, respectively.
翻译:自回归建模(AM)在计算机视觉领域的自监督预训练发展中落后于自然语言处理(NLP)。这主要是由于图像并非序列信号,在应用自回归建模时缺乏天然顺序所造成的挑战。在本研究中,受人类把握图像方式(即先关注主要物体)的启发,我们提出了一种语义感知自回归图像建模(SemAIM)方法来解决这一挑战。SemAIM的关键思想是从语义块向低语义块进行图像的自回归建模。为此,我们首先根据图像块的特征相似性计算语义感知排列,然后基于该排列执行自回归过程。此外,考虑到图像块的原始像素属于低级信号,并非学习高级语义表示的理想预测目标,我们还探索了利用图像块特征作为预测目标。我们在包括图像分类、目标检测和实例/语义分割在内的广泛下游任务上进行了大量实验,以评估SemAIM的性能。结果表明,与其他自监督方法相比,SemAIM取得了最先进的性能。具体而言,使用ViT-B,SemAIM在ImageNet上微调时达到84.1%的top-1准确率,在COCO上目标检测和实例分割分别达到51.3% AP和45.4% AP,相较于原始MAE分别提升0.5%、1.0%和0.5%。