Dataset distillation methods have achieved remarkable success in distilling a large dataset into a small set of representative samples. However, they are not designed to produce a distilled dataset that can be effectively used for facilitating self-supervised pre-training. To this end, we propose a novel problem of distilling an unlabeled dataset into a set of small synthetic samples for efficient self-supervised learning (SSL). We first prove that a gradient of synthetic samples with respect to a SSL objective in naive bilevel optimization is \textit{biased} due to the randomness originating from data augmentations or masking. To address this issue, we propose to minimize the mean squared error (MSE) between a model's representations of the synthetic examples and their corresponding learnable target feature representations for the inner objective, which does not introduce any randomness. Our primary motivation is that the model obtained by the proposed inner optimization can mimic the \textit{self-supervised target model}. To achieve this, we also introduce the MSE between representations of the inner model and the self-supervised target model on the original full dataset for outer optimization. Lastly, assuming that a feature extractor is fixed, we only optimize a linear head on top of the feature extractor, which allows us to reduce the computational cost and obtain a closed-form solution of the head with kernel ridge regression. We empirically validate the effectiveness of our method on various applications involving transfer learning.
翻译:数据集蒸馏方法在将大规模数据集蒸馏为少量代表性样本方面取得了显著成功。然而,这些方法并未被设计用于生成能有效辅助自监督预训练的蒸馏数据集。为此,我们提出一个新问题:将无标签数据集蒸馏为少量合成样本,以实现高效的自监督学习。我们首先证明,在朴素双层优化中,合成样本相对于自监督学习目标的梯度因数据增强或掩码引入的随机性而存在偏差。为解决此问题,我们提出在内层优化中最小化模型对合成样本的表示与对应的可学习目标特征表示之间的均方误差(MSE),该方法不引入任何随机性。我们的核心动机是:通过所提出的内层优化获得的模型能够模仿自监督目标模型。为实现这一点,我们还在外层优化中引入内层模型与自监督目标模型在原始完整数据集上表示之间的均方误差。最后,假设特征提取器固定,我们仅优化特征提取器顶部的线性分类头,从而降低计算成本并通过核岭回归获得分类头的闭式解。我们在涉及迁移学习的多种应用中实证验证了该方法的效果。