Dataset reduction (DR) seeks to select or distill samples from large datasets into smaller subsets while preserving performance on target tasks. Existing methods primarily focus on pruning or synthesizing data in the same format as the original dataset, typically the input data and corresponding labels. However, in DR settings, we find it is possible to synthesize more information beyond the data-label pair as an additional learning target to facilitate model training. In this paper, we introduce Dataset Reduction Using Privileged Information (DRUPI), which enriches DR by synthesizing privileged information alongside the reduced dataset. This privileged information can take the form of feature labels or attention labels, providing auxiliary supervision to improve model learning. Our findings reveal that effective feature labels must balance between being overly discriminative and excessively diverse, with a moderate level proving optimal for improving the reduced dataset's efficacy. Extensive experiments on ImageNet, CIFAR-10/100, and Tiny ImageNet demonstrate that DRUPI integrates seamlessly with existing dataset reduction methods, offering significant performance gains.
翻译:数据集缩减(DR)旨在从大型数据集中选择或提炼样本,形成更小的子集,同时保持目标任务的性能。现有方法主要侧重于以与原始数据集相同的格式(通常是输入数据及其对应标签)对数据进行剪裁或合成。然而,在DR场景中,我们发现除了数据-标签对外,还可以合成更多信息作为额外的学习目标,以促进模型训练。本文提出利用特权信息的数据集缩减(DRUPI),该方法通过合成特权信息与缩减后的数据集相结合,从而丰富了DR的内涵。此类特权信息可以表现为特征标签或注意力标签的形式,为模型学习提供辅助监督。我们的研究发现,有效的特征标签必须在过度判别性与过度多样性之间取得平衡,适中的水平被证明是提升缩减数据集效能的最优选择。在ImageNet、CIFAR-10/100以及Tiny ImageNet上进行的大量实验表明,DRUPI能够与现有的数据集缩减方法无缝集成,并带来显著的性能提升。