The recently proposed data augmentation TransMix employs attention labels to help visual transformers (ViT) achieve better robustness and performance. However, TransMix is deficient in two aspects: 1) The image cropping method of TransMix may not be suitable for ViTs. 2) At the early stage of training, the model produces unreliable attention maps. TransMix uses unreliable attention maps to compute mixed attention labels that can affect the model. To address the aforementioned issues, we propose MaskMix and Progressive Attention Labeling (PAL) in image and label space, respectively. In detail, from the perspective of image space, we design MaskMix, which mixes two images based on a patch-like grid mask. In particular, the size of each mask patch is adjustable and is a multiple of the image patch size, which ensures each image patch comes from only one image and contains more global contents. From the perspective of label space, we design PAL, which utilizes a progressive factor to dynamically re-weight the attention weights of the mixed attention label. Finally, we combine MaskMix and Progressive Attention Labeling as our new data augmentation method, named MixPro. The experimental results show that our method can improve various ViT-based models at scales on ImageNet classification (73.8\% top-1 accuracy based on DeiT-T for 300 epochs). After being pre-trained with MixPro on ImageNet, the ViT-based models also demonstrate better transferability to semantic segmentation, object detection, and instance segmentation. Furthermore, compared to TransMix, MixPro also shows stronger robustness on several benchmarks. The code is available at https://github.com/fistyee/MixPro.
翻译:摘要:近期提出的数据增强方法TransMix通过引入注意力标签,帮助视觉Transformer(ViT)获得更优的鲁棒性和性能。然而,TransMix存在两方面的不足:1)其图像裁剪方式可能不适用于ViT模型;2)在训练早期阶段,模型生成的注意力图不可靠。TransMix使用这些不可靠的注意力图计算混合注意力标签,进而影响模型性能。针对上述问题,我们分别从图像空间和标签空间提出MaskMix与渐进式注意力标签(PAL)。具体而言,在图像空间层面,我们设计了MaskMix方法,通过基于块状网格掩码对两张图像进行混合。其中,每个掩码块的大小可调节且为图像块大小的整数倍,确保每个图像块仅来源于单一图像并包含更多全局内容。在标签空间层面,我们提出PAL方法,利用渐进因子动态调整混合注意力标签中各注意力权重的比重。最终,我们将MaskMix与渐进式注意力标签整合为新的数据增强方法MixPro。实验结果表明,该方法能有效提升多种尺度ViT模型在ImageNet分类任务上的性能(基于DeiT-T模型训练300轮达到73.8% top-1准确率)。在ImageNet上使用MixPro预训练后,ViT模型在语义分割、目标检测和实例分割任务中展现出更强的迁移能力。此外,与TransMix相比,MixPro在多个基准测试中表现出更强的鲁棒性。代码开源地址:https://github.com/fistyee/MixPro。