Despite their success, large pretrained vision models remain vulnerable to catastrophic forgetting when adapted to new tasks in class-incremental settings. Parameter-efficient fine-tuning (PEFT) alleviates this by restricting trainable parameters, yet most approaches still rely on cross-entropy (CE) loss, a surrogate for the 0-1 loss, to learn from new data. We revisit this choice and revive the true objective (0-1 loss) through a reinforcement learning perspective. By formulating classification as a one-step Markov Decision Process, we derive an Expected Policy Gradient (EPG) method that directly minimizes misclassification error with a low-variance gradient estimation. Our analysis shows that CE can be interpreted as EPG with an additional sample-weighting mechanism: CE encourages exploration by emphasizing low-confidence samples, while EPG prioritizes high-confidence ones. Building on this insight, we propose adaptive entropy annealing (aEPG), a training strategy that transitions from exploratory (CE-like) to exploitative (EPG-like) learning. aEPG-based methods outperform CE-based methods across diverse benchmarks and with various PEFT modules. More broadly, we evaluate various entropy regularization methods and demonstrate that lower entropy of the output prediction distribution enhances adaptation in pretrained vision models.
翻译:尽管取得了成功,大型预训练视觉模型在适应类别增量设置中的新任务时,仍然容易遭受灾难性遗忘。参数高效微调通过限制可训练参数来缓解这一问题,然而大多数方法仍然依赖交叉熵损失(作为0-1损失的替代)来从新数据中学习。我们重新审视这一选择,并通过强化学习视角复兴了真实目标(0-1损失)。通过将分类建模为一步马尔可夫决策过程,我们推导出一种期望策略梯度方法,该方法能够通过低方差梯度估计直接最小化误分类误差。我们的分析表明,交叉熵可被解释为带有额外样本加权机制的期望策略梯度:交叉熵通过强调低置信度样本来鼓励探索,而期望策略梯度则优先考虑高置信度样本。基于这一洞见,我们提出自适应熵退火,这是一种从探索性(类交叉熵)学习过渡到利用性(类期望策略梯度)学习的训练策略。基于自适应熵退火的方法在多种基准测试和不同参数高效微调模块中均优于基于交叉熵的方法。更广泛地,我们评估了多种熵正则化方法,并证明输出预测分布的低熵能够增强预训练视觉模型的适应能力。