Dataset distillation has emerged as a strategy to overcome the hurdles associated with large datasets by learning a compact set of synthetic data that retains essential information from the original dataset. While distilled data can be used to train high performing models, little is understood about how the information is stored. In this study, we posit and answer three questions about the behavior, representativeness, and point-wise information content of distilled data. We reveal distilled data cannot serve as a substitute for real data during training outside the standard evaluation setting for dataset distillation. Additionally, the distillation process retains high task performance by compressing information related to the early training dynamics of real models. Finally, we provide an framework for interpreting distilled data and reveal that individual distilled data points contain meaningful semantic information. This investigation sheds light on the intricate nature of distilled data, providing a better understanding on how they can be effectively utilized.
翻译:数据集蒸馏已成为一种通过从原始数据集中学习一组紧凑的合成数据来克服大型数据集相关障碍的策略。尽管蒸馏数据可用于训练高性能模型,但人们对信息如何存储知之甚少。在本研究中,我们提出并回答了关于蒸馏数据的行为、代表性及逐点信息内容的三个问题。我们揭示,在数据集蒸馏的标准评估设置之外,蒸馏数据无法作为训练过程中真实数据的替代品。此外,蒸馏过程通过压缩与真实模型早期训练动态相关的信息来保持较高的任务性能。最后,我们提供了一个解释蒸馏数据的框架,并揭示单个蒸馏数据点包含有意义的语义信息。这项研究揭示了蒸馏数据的复杂本质,为如何有效利用它们提供了更深入的理解。