The recently proposed data augmentation TransMix employs attention labels to help visual transformers (ViT) achieve better robustness and performance. However, TransMix is deficient in two aspects: 1) The image cropping method of TransMix may not be suitable for vision transformer. 2) At the early stage of training, the model produces unreliable attention maps. TransMix uses unreliable attention maps to compute mixed attention labels that can affect the model. To address the aforementioned issues, we propose MaskMix and Progressive Attention Labeling (PAL) in image and label space, respectively. In detail, from the perspective of image space, we design MaskMix, which mixes two images based on a patch-like grid mask. In particular, the size of each mask patch is adjustable and is a multiple of the image patch size, which ensures each image patch comes from only one image and contains more global contents. From the perspective of label space, we design PAL, which utilizes a progressive factor to dynamically re-weight the attention weights of the mixed attention label. Finally, we combine MaskMix and Progressive Attention Labeling as our new data augmentation method, named MixPro. The experimental results show that our method can improve various ViT-based models at scales on ImageNet classification (73.8\% top-1 accuracy based on DeiT-T for 300 epochs). After being pre-trained with MixPro on ImageNet, the ViT-based models also demonstrate better transferability to semantic segmentation, object detection, and instance segmentation. Furthermore, compared to TransMix, MixPro also shows stronger robustness on several benchmarks. The code will be released at https://github.com/fistyee/MixPro.
翻译:摘要:近期提出的数据增广方法TransMix通过引入注意力标签,帮助视觉Transformer(ViT)获得更好的鲁棒性与性能。然而,TransMix存在两方面不足:1)其图像裁剪方式可能不适用于视觉Transformer;2)在训练初期阶段,模型生成的注意力图并不可靠,而TransMix利用这些不可靠的注意力图计算混合注意力标签,进而影响模型性能。针对上述问题,我们分别从图像空间与标签空间提出MaskMix与渐进注意力标注(PAL)。具体而言,在图像空间层面,我们设计了MaskMix方法,通过基于网格块结构的掩码混合两幅图像。每个掩码块的大小可调节,且为图像块尺寸的整数倍,从而确保每个图像块仅来源于单一图像并包含更多全局内容。在标签空间层面,我们设计了PAL方法,通过引入渐进因子动态重新加权混合注意力标签的注意力权重。最终,我们将MaskMix与渐进注意力标注结合为新的数据增广方法——MixPro。实验结果表明,该方法在ImageNet分类任务中能提升多种不同规模的ViT模型性能(基于DeiT-T训练300轮达到73.8%的Top-1准确率)。经MixPro预训练后,ViT模型在语义分割、目标检测与实例分割任务上展现出更强的迁移能力。此外,与TransMix相比,MixPro在多个基准测试中亦表现出更强的鲁棒性。相关代码将发布于https://github.com/fistyee/MixPro。