Not All Forgetting Is Equal: Architecture-Dependent Retention Dynamics in Fine-Tuned Image Classifiers

Fine-tuning pretrained image classifiers is standard practice, yet which individual samples are forgotten during this process, and whether forgetting patterns are stable or architecture dependent, remains unclear. Understanding these dynamics has direct implications for curriculum design, data pruning, and ensemble construction. We track per-sample correctness at every epoch during fine-tuning of ResNet-18 and DeiT-Small on a retinal OCT dataset (7 classes, 56:1 imbalance) and CUB-200-2011 (200 bird species), fitting Ebbinghaus-style exponential decay curves to each sample's retention trace. Five findings emerge. First, the two architectures forget fundamentally different samples: Jaccard overlap of the top 10 percent most-forgotten is 0.34 on OCTDL and 0.15 on CUB-200. Second, ViT forgetting is more structured (mean $R^2 = 0.74$) than CNN forgetting ($R^2 = 0.52$). Third, per-sample forgetting is stochastic across random seeds (Spearman $ρ\approx 0.01$), challenging the assumption that sample difficulty is an intrinsic property. Fourth, class-level forgetting is consistent and semantically interpretable: visually similar species are forgotten most, distinctive ones least. Fifth, a sample's loss after head warmup predicts its long-term decay constant ($ρ= 0.30$ to $0.50$, $p < 10^{-45}$). These findings suggest that architectural diversity in ensembles provides complementary retention coverage, and that curriculum or pruning methods based on per-sample difficulty may not generalize across runs. A spaced repetition sampler built on these decay constants does not outperform random sampling, indicating that static scheduling cannot exploit unstable per-sample signals.

翻译：微调预训练图像分类器是标准实践，但在此过程中哪些个体样本会被遗忘，以及遗忘模式是稳定的还是与架构相关，目前仍不清楚。理解这些动态对课程设计、数据剪枝和集成构建具有直接影响。我们在视网膜OCT数据集（7类，56:1不平衡）和CUB-200-2011（200种鸟类）上微调ResNet-18和DeiT-Small期间，追踪每个epoch的样本正确性，并为每个样本的保留轨迹拟合艾宾浩斯式指数衰减曲线。五个发现由此产生。第一，两种架构遗忘的样本根本不同：遗忘最严重的10%样本的Jaccard重叠在OCTDL上为0.34，在CUB-200上为0.15。第二，ViT的遗忘更具结构性（平均$R^2 = 0.74$），优于CNN的遗忘（$R^2 = 0.52$）。第三，样本级遗忘在不同随机种子下具有随机性（Spearman $ρ\approx 0.01$），挑战了样本难度是内在属性的假设。第四，类别级遗忘具有一致性和语义可解释性：视觉相似的物种遗忘最多，独特的物种遗忘最少。第五，样本在头部预热后的损失可预测其长期衰减常数（$ρ= 0.30$至$0.50$，$p < 10^{-45}$）。这些发现表明，集成中的架构多样性提供了互补的保留覆盖，而基于样本难度的课程或剪枝方法可能无法跨训练运行泛化。基于这些衰减常数构建的间隔重复采样器并未优于随机采样，表明静态调度无法利用不稳定的样本级信号。