Vision Transformers have achieved great success in computer visions, delivering exceptional performance across various tasks. However, their inherent reliance on sequential input enforces the manual partitioning of images into patch sequences, which disrupts the image's inherent structural and semantic continuity. To handle this, we propose a novel Pattern Transformer (Patternformer) to adaptively convert images to pattern sequences for Transformer input. Specifically, we employ the Convolutional Neural Network to extract various patterns from the input image, with each channel representing a unique pattern that is fed into the succeeding Transformer as a visual token. By enabling the network to optimize these patterns, each pattern concentrates on its local region of interest, thereby preserving its intrinsic structural and semantic information. Only employing the vanilla ResNet and Transformer, we have accomplished state-of-the-art performance on CIFAR-10 and CIFAR-100, and have achieved competitive results on ImageNet.
翻译:视觉Transformer在计算机视觉领域取得了巨大成功,在各项任务中展现了卓越性能。然而,其固有的序列输入机制要求将图像人工分割成补丁序列,这破坏了图像固有的结构和语义连续性。为此,我们提出了一种新型模式Transformer(Patternformer),能够将图像自适应转换为模式序列作为Transformer输入。具体而言,我们采用卷积神经网络从输入图像中提取多种模式,每个通道代表一种独特模式,并作为视觉标记输入后续Transformer。通过让网络自主优化这些模式,每个模式聚焦其局部感兴趣区域,从而保留其内在的结构和语义信息。仅采用标准ResNet和Transformer架构,我们在CIFAR-10和CIFAR-100数据集上取得了最先进的性能,并在ImageNet上获得了具有竞争力的结果。