Dataset distillation extracts a small set of synthetic training samples from a large dataset with the goal of achieving competitive performance on test data when trained on this sample. In this work, we tackle dataset distillation at its core by treating it directly as a bilevel optimization problem. Re-examining the foundational back-propagation through time method, we study the pronounced variance in the gradients, computational burden, and long-term dependencies. We introduce an improved method: Random Truncated Backpropagation Through Time (RaT-BPTT) to address them. RaT-BPTT incorporates a truncation coupled with a random window, effectively stabilizing the gradients and speeding up the optimization while covering long dependencies. This allows us to establish new state-of-the-art for a variety of standard dataset benchmarks. A deeper dive into the nature of distilled data unveils pronounced intercorrelation. In particular, subsets of distilled datasets tend to exhibit much worse performance than directly distilled smaller datasets of the same size. Leveraging RaT-BPTT, we devise a boosting mechanism that generates distilled datasets that contain subsets with near optimal performance across different data budgets.
翻译:数据集蒸馏旨在从大型数据集中提取一小组合成训练样本,目标是在使用该样本训练时,在测试数据上取得具有竞争力的性能。在本工作中,我们从核心入手,直接将数据集蒸馏视为一个双层优化问题。通过重新审视基础的随时间反向传播方法,我们研究了梯度中显著的方差、计算负担以及长期依赖问题。我们提出一种改进方法:随机截断随时间反向传播(RaT-BPTT)来解决这些问题。RaT-BPTT 引入了一种截断机制并结合随机窗口,有效稳定了梯度并加速了优化过程,同时覆盖了长期依赖。这使我们能够在多种标准数据集基准上建立新的最优结果。对蒸馏数据本质的深入探究揭示了其显著的互相关性。特别是,蒸馏数据集的子集往往表现出比直接蒸馏相同大小的较小数据集更差的性能。利用 RaT-BPTT,我们设计了一种增强机制,生成的蒸馏数据集包含的子集在不同数据预算下均能接近最优性能。