Dataset distillation, a training-aware data compression technique, has recently attracted increasing attention as an effective tool for mitigating costs of optimization and data storage. However, progress remains largely empirical. Mechanisms underlying the extraction of task-relevant information from the training process and the efficient encoding of such information into synthetic data points remain elusive. In this paper, we theoretically analyze practical algorithms of dataset distillation applied to the gradient-based training of two-layer neural networks with width $L$. By focusing on a non-linear task structure called multi-index model, we prove that the low-dimensional structure of the problem is efficiently encoded into the resulting distilled data. This dataset reproduces a model with high generalization ability for a required memory complexity of $\tildeΘ$$(r^2d+L)$, where $d$ and $r$ are the input and intrinsic dimensions of the task. To the best of our knowledge, this is one of the first theoretical works that include a specific task structure, leverage its intrinsic dimensionality to quantify the compression rate and study dataset distillation implemented solely via gradient-based algorithms.
翻译:数据集蒸馏作为一种训练感知的数据压缩技术,近年来因其在降低优化和数据存储成本方面的有效性而备受关注。然而,当前研究仍以经验性进展为主——从训练过程中提取任务相关信息并高效编码至合成数据点的底层机制尚不明确。本文针对宽度为$L$的两层神经网络梯度训练中实际采用的数据集蒸馏算法进行理论分析。通过聚焦于称为多指数模型的非线性任务结构,我们证明问题的低维结构可被高效编码至最终蒸馏数据中。该数据集能以$\tildeΘ$$(r^2d+L)$的内存复杂度复现出具有高泛化能力的模型,其中$d$和$r$分别为任务的输入维度和本征维度。据我们所知,这是首批引入特定任务结构、利用其本征维度量化压缩率,并研究完全基于梯度算法实现的数据集蒸馏的理论工作之一。