Learning new classes without forgetting is crucial for real-world applications for a classification model. Vision Transformers (ViT) recently achieve remarkable performance in Class Incremental Learning (CIL). Previous works mainly focus on block design and model expansion for ViTs. However, in this paper, we find that when the ViT is incrementally trained, the attention layers gradually lose concentration on local features. We call this interesting phenomenon as \emph{Locality Degradation} in ViTs for CIL. Since the low-level local information is crucial to the transferability of the representation, it is beneficial to preserve the locality in attention layers. In this paper, we encourage the model to preserve more local information as the training procedure goes on and devise a Locality-Preserved Attention (LPA) layer to emphasize the importance of local features. Specifically, we incorporate the local information directly into the vanilla attention and control the initial gradients of the vanilla attention by weighting it with a small initial value. Extensive experiments show that the representations facilitated by LPA capture more low-level general information which is easier to transfer to follow-up tasks. The improved model gets consistently better performance on CIFAR100 and ImageNet100.
翻译:在不遗忘旧知识的前提下学习新类别,是分类模型在实际应用中的关键能力。视觉Transformer(ViT)近期在类增量学习(CIL)中取得了显著成果。现有研究主要关注ViT的模块设计与模型扩展。然而,本文发现,当ViT进行增量训练时,注意力层会逐渐丧失对局部特征的关注。我们将ViT在CIL中的这一有趣现象称为"局部性退化"。由于底层局部信息对表征的可迁移性至关重要,因此保持注意力层中的局部性具有重要意义。本文鼓励模型在训练过程中持续保留更多局部信息,并设计了一种局部性保持注意力(LPA)层来强调局部特征的重要性。具体而言,我们将局部信息直接融入标准注意力机制,并通过赋予标准注意力一个较小的初始权重来控制其初始梯度。大量实验表明,经LPA增强的表征能够捕获更多底层通用信息,更易于迁移至后续任务。改进后的模型在CIFAR100和ImageNet100数据集上均取得了持续更优的性能。