The growing demand for efficient deep learning has positioned dataset distillation as a pivotal technique for compressing training dataset while preserving model performance. However, existing inner-loop optimization methods for dataset distillation typically rely on random truncation strategies, which lack flexibility and often yield suboptimal results. In this work, we observe that neural networks exhibit distinct learning dynamics across different training stages-early, middle, and late-making random truncation ineffective. To address this limitation, we propose Automatic Truncated Backpropagation Through Time (AT-BPTT), a novel framework that dynamically adapts both truncation positions and window sizes according to intrinsic gradient behavior. AT-BPTT introduces three key components: (1) a probabilistic mechanism for stage-aware timestep selection, (2) an adaptive window sizing strategy based on gradient variation, and (3) a low-rank Hessian approximation to reduce computational overhead. Extensive experiments on CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet-1K show that AT-BPTT achieves state-of-the-art performance, improving accuracy by an average of 6.16% over baseline methods. Moreover, our approach accelerates inner-loop optimization by 3.9x while saving 63% memory cost.
翻译:对高效深度学习的日益增长的需求,使得数据集蒸馏成为在保持模型性能的同时压缩训练数据集的关键技术。然而,现有的数据集蒸馏内循环优化方法通常依赖于随机截断策略,这些策略缺乏灵活性,且往往产生次优结果。在这项工作中,我们观察到神经网络在不同训练阶段(早期、中期和晚期)表现出不同的学习动态,这使得随机截断效果不佳。为了解决这一局限性,我们提出了自动截断时间反向传播(AT-BPTT),这是一个新颖的框架,能够根据内在梯度行为动态调整截断位置和窗口大小。AT-BPTT引入了三个关键组件:(1) 一种用于阶段感知时间步选择的概率机制,(2) 一种基于梯度变化的自适应窗口大小调整策略,以及(3) 一种用于降低计算开销的低秩Hessian近似方法。在CIFAR-10、CIFAR-100、Tiny-ImageNet和ImageNet-1K上进行的大量实验表明,AT-BPTT实现了最先进的性能,相比基线方法平均准确率提高了6.16%。此外,我们的方法将内循环优化速度提升了3.9倍,同时节省了63%的内存成本。