On-device continual learning (CL) requires the co-optimization of model accuracy and resource efficiency to be practical. This is extremely challenging because it must preserve accuracy while learning new tasks with continuously drifting data and maintain both high energy and memory efficiency to be deployable on real-world devices. Typically, a CL method leverages one of two types of backbone networks: CNN or ViT. It is commonly believed that CNN-based CL excels in resource efficiency, whereas ViT-based CL is superior in model performance, making each option attractive only for a single aspect. In this paper, we revisit this comparison while embracing powerful pre-trained ViT models of various sizes, including ViT-Ti (5.8M parameters). Our detailed analysis reveals that many practical options exist today for making ViT-based methods more suitable for on-device CL, even when accuracy, energy, and memory are all considered. To further expand this impact, we introduce REP, which improves resource efficiency specifically targeting prompt-based rehearsal-free methods. Our key focus is on avoiding catastrophic trade-offs with accuracy while trimming computational and memory costs throughout the training process. We achieve this by exploiting swift prompt selection that enhances input data using a carefully provisioned model, and by developing two novel algorithms-adaptive token merging (AToM) and adaptive layer dropping (ALD)-that optimize the prompt updating stage. In particular, AToM and ALD perform selective skipping across the data and model-layer dimensions without compromising task-specific features in vision transformer models. Extensive experiments on three image classification datasets validate REP's superior resource efficiency over current state-of-the-art methods.
翻译:设备端持续学习(CL)需要同时优化模型精度与资源效率才能具备实用性。这一目标极具挑战性,因为系统必须在数据持续漂移的条件下学习新任务并保持精度,同时维持高能效与高内存效率,才能部署于现实设备中。典型的持续学习方法采用两类骨干网络之一:CNN或ViT。普遍观点认为,基于CNN的CL在资源效率方面表现优异,而基于ViT的CL在模型性能上更具优势,使得两种方案各自仅在某一方面具有吸引力。本文重新审视了这一对比,并引入包括ViT-Ti(580万参数)在内的多种尺寸的强大预训练ViT模型。我们的详细分析表明,即使在同时考量精度、能耗与内存的情况下,当前已存在多种实用方案可使基于ViT的方法更适用于设备端持续学习。为进一步拓展这一影响,我们提出了REP方法,专门针对基于提示的无回放方法提升其资源效率。我们的核心目标是在全面削减训练过程中的计算与内存开销的同时,避免与模型精度产生灾难性权衡。为此,我们通过利用经精心配置的模型增强输入数据的快速提示选择机制,并开发了两种新颖算法——自适应令牌合并(AToM)与自适应层丢弃(ALD)——以优化提示更新阶段。特别地,AToM与ALD能在不损害视觉Transformer模型中任务特定特征的前提下,跨数据和模型层维度执行选择性跳过操作。在三个图像分类数据集上的大量实验验证了REP相比当前最先进方法具有更优越的资源效率。