Human visual recognition system shows astonishing capability of compressing visual information into a set of tokens containing rich representations without label supervision. One critical driving principle behind it is perceptual grouping. Despite being widely used in computer vision in the early 2010s, it remains a mystery whether perceptual grouping can be leveraged to derive a neural visual recognition backbone that generates as powerful representations. In this paper, we propose the Perceptual Group Tokenizer, a model that entirely relies on grouping operations to extract visual features and perform self-supervised representation learning, where a series of grouping operations are used to iteratively hypothesize the context for pixels or superpixels to refine feature representations. We show that the proposed model can achieve competitive performance compared to state-of-the-art vision architectures, and inherits desirable properties including adaptive computation without re-training, and interpretability. Specifically, Perceptual Group Tokenizer achieves 80.3% on ImageNet-1K self-supervised learning benchmark with linear probe evaluation, marking a new progress under this paradigm.
翻译:人类视觉识别系统展现出惊人的能力,能在无标签监督下将视觉信息压缩为一组包含丰富表征的标记。其背后的关键驱动原理之一是感知分组。尽管在2010年代初被广泛用于计算机视觉,但感知分组能否被用于构建一个能生成如此强大表征的神经视觉识别主干仍是个谜。本文提出感知分组分词器(Perceptual Group Tokenizer),该模型完全依赖分组操作来提取视觉特征并进行自监督表征学习,其中通过一系列分组操作迭代地假设像素或超像素的上下文以优化特征表征。实验表明,该模型能与最先进的视觉架构取得相当的性能,并继承了无需重新训练即可进行自适应计算和可解释性等理想特性。具体而言,感知分组分词器在ImageNet-1K自监督学习基准中采用线性探测评估达到了80.3%的准确率,标志着在该范式下取得了新的进展。