Perceptual Group Tokenizer: Building Perception with Iterative Grouping

Human visual recognition system shows astonishing capability of compressing visual information into a set of tokens containing rich representations without label supervision. One critical driving principle behind it is perceptual grouping. Despite being widely used in computer vision in the early 2010s, it remains a mystery whether perceptual grouping can be leveraged to derive a neural visual recognition backbone that generates as powerful representations. In this paper, we propose the Perceptual Group Tokenizer, a model that entirely relies on grouping operations to extract visual features and perform self-supervised representation learning, where a series of grouping operations are used to iteratively hypothesize the context for pixels or superpixels to refine feature representations. We show that the proposed model can achieve competitive performance compared to state-of-the-art vision architectures, and inherits desirable properties including adaptive computation without re-training, and interpretability. Specifically, Perceptual Group Tokenizer achieves 80.3% on ImageNet-1K self-supervised learning benchmark with linear probe evaluation, marking a new progress under this paradigm.

翻译：人类视觉识别系统具有惊人的能力，能够将视觉信息压缩为一组包含丰富表示的标记，而无需标签监督。其背后的关键驱动原则之一是感知分组。尽管感知分组在21世纪初已被广泛应用于计算机视觉领域，但将其用于构建能够生成强大表示的神经视觉识别骨干网络是否可行，仍是一个未解之谜。本文提出感知分组分词器（Perceptual Group Tokenizer），该模型完全依赖分组操作来提取视觉特征并进行自监督表示学习，通过一系列分组操作迭代地为像素或超像素假设上下文，以精炼特征表示。实验表明，所提模型能够与最先进的视觉架构取得相当的性能，并继承了无需重新训练的自适应计算和可解释性等理想特性。具体而言，Perceptual Group Tokenizer在ImageNet-1K自监督学习基准测试中采用线性探测评估方法达到80.3%的准确率，标志着该范式下的新进展。

相关内容

GROUP

关注 1

Group一直是研究计算机支持的合作工作、人机交互、计算机支持的协作学习和社会技术研究的主要场所。该会议将社会科学、计算机科学、工程、设计、价值观以及其他与小组工作相关的多个不同主题的工作结合起来，并进行了广泛的概念化。官网链接：https://group.acm.org/conferences/group20/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日