Large-scale web-crawled datasets contain noise, bias, and irrelevant information, necessitating data selection techniques. Existing methods depend on hand-crafted heuristics, downstream datasets, or require expensive influence-based computations -- all of which limit scalability and introduce unwanted data dependencies. To address this, we introduce the Mimic Score, a simple and geometry-based data-quality metric that evaluates utility by measuring alignment between a sample's gradients and a target direction induced by a pre-trained reference model. This leverages readily available model weights, avoids needing validation datasets, and incurs minimal computational overheads. Building on this metric, we propose Grad-Mimic, a two-stage framework that re-weights samples online to accelerate training and aggregates sample utilities offline to construct effective data filters. Empirically, we show that using mimic scores to guide training improves data efficiency, accelerates convergence, yields consistent performance gains across six image datasets, and enhances CLIP models with 20.7% fewer training steps. Additionally, mimic score-based filters augment existing filtering techniques, enabling improved CLIP models trained with 4.7 million fewer samples.
翻译:大规模网络爬取数据集包含噪声、偏差及无关信息,因此需要数据选择技术。现有方法依赖于人工设计的启发式规则、下游数据集或需要昂贵的基于影响力的计算——这些均限制了可扩展性并引入了不必要的数据依赖。为解决此问题,我们提出了模拟分数,这是一种基于几何结构的简易数据质量度量标准,通过测量样本梯度与预训练参考模型诱导的目标方向之间的对齐程度来评估效用。该方法利用现成的模型权重,无需验证数据集,且计算开销极小。基于此度量标准,我们提出Grad-Mimic框架:该两阶段框架在线重加权样本以加速训练,并离线聚合样本效用以构建高效数据过滤器。实验表明,使用模拟分数指导训练可提升数据效率、加速收敛,在六个图像数据集上获得一致的性能提升,并使CLIP模型训练步数减少20.7%。此外,基于模拟分数的过滤器增强了现有过滤技术,使用减少470万个训练样本的CLIP模型实现了性能改进。