Dataset distillation, a pragmatic approach in machine learning, aims to create a smaller synthetic dataset from a larger existing dataset. However, existing distillation methods primarily adopt a model-based paradigm, where the synthetic dataset inherits model-specific biases, limiting its generalizability to alternative models. In response to this constraint, we propose a novel methodology termed "model pool". This approach involves selecting models from a diverse model pool based on a specific probability distribution during the data distillation process. Additionally, we integrate our model pool with the established knowledge distillation approach and apply knowledge distillation to the test process of the distilled dataset. Our experimental results validate the effectiveness of the model pool approach across a range of existing models while testing, demonstrating superior performance compared to existing methodologies.
翻译:数据集蒸馏是机器学习中一种实用方法,旨在从较大的现有数据集中创建更小的合成数据集。然而,现有蒸馏方法主要采用基于模型的范式,使得合成数据集继承了模型特定的偏差,从而限制了其对其他模型的泛化能力。针对这一限制,我们提出了一种名为"模型池"的新方法论。该方法在数据蒸馏过程中根据特定概率分布从多样化的模型池中选择模型。此外,我们将模型池与现有的知识蒸馏方法相结合,并将知识蒸馏应用于蒸馏数据集的测试过程。我们的实验结果验证了模型池方法在测试时对一系列现有模型的有效性,并显示出相较于现有方法的优越性能。