Recent advancements in Deep Neural Network (DNN) models have significantly improved performance across computer vision tasks. However, achieving highly generalizable and high-performing vision models requires extensive datasets, leading to large storage requirements. This storage challenge poses a critical bottleneck for scaling up vision models. Motivated by the success of discrete representations, SeiT proposes to use Vector-Quantized (VQ) feature vectors (i.e., tokens) as network inputs for vision classification. However, applying traditional data augmentations to tokens faces challenges due to input domain shift. To address this issue, we introduce TokenAdapt and ColorAdapt, simple yet effective token-based augmentation strategies. TokenAdapt realigns token embedding space for compatibility with spatial augmentations, preserving the model's efficiency without requiring fine-tuning. Additionally, ColorAdapt addresses color-based augmentations for tokens inspired by Adaptive Instance Normalization (AdaIN). We evaluate our approach across various scenarios, including storage-efficient ImageNet-1k classification, fine-grained classification, robustness benchmarks, and ADE-20k semantic segmentation. Experimental results demonstrate consistent performance improvement in diverse experiments. Code is available at https://github.com/naver-ai/tokenadapt.
翻译:深度神经网络模型的最新进展显著提升了计算机视觉任务的性能。然而,实现高度泛化且高性能的视觉模型需要大规模数据集,进而导致巨大的存储需求。这一存储瓶颈成为扩展视觉模型的关键障碍。受离散表示成功经验的启发,SeiT提出使用向量量化特征向量(即令牌)作为视觉分类的网络输入。但传统数据增强方法因输入域偏移而难以直接应用于令牌。为解决此问题,我们提出TokenAdapt和ColorAdapt——两种简单而有效的基于令牌的增强策略。TokenAdapt重新对齐令牌嵌入空间以兼容空间增强,无需微调即可保持模型效率;ColorAdapt受自适应实例归一化启发,专为令牌设计颜色增强方案。我们在存储高效ImageNet-1k分类、细粒度分类、鲁棒性基准及ADE-20k语义分割等场景中评估了该方法。实验结果表明,该方法在各类实验中均能持续提升性能。代码已开源至https://github.com/naver-ai/tokenadapt。