Mislabeled, duplicated, or biased data in real-world scenarios can lead to prolonged training and even hinder model convergence. Traditional solutions prioritizing easy or hard samples lack the flexibility to handle such a variety simultaneously. Recent work has proposed a more reasonable data selection principle by examining the data's impact on the model's generalization loss. However, its practical adoption relies on less principled approximations and additional holdout data. This work solves these problems by leveraging a lightweight Bayesian treatment and incorporating off-the-shelf zero-shot predictors built on large-scale pre-trained models. The resulting algorithm is efficient and easy to implement. We perform extensive empirical studies on challenging benchmarks with considerable data noise and imbalance in the online batch selection scenario, and observe superior training efficiency over competitive baselines. Notably, on the challenging WebVision benchmark, our method can achieve similar predictive performance with significantly fewer training iterations than leading data selection methods.
翻译:现实场景中的标签错误、重复或偏差数据可能导致训练时间延长,甚至阻碍模型收敛。传统方法优先处理简单或困难样本,缺乏同时应对此类多样问题的灵活性。近期研究通过考察数据对模型泛化损失的影响,提出了更合理的数据选择原则,但其实际应用依赖于非严格近似的处理方法及额外的保留数据。本文通过引入轻量级贝叶斯处理方案,并集成基于大规模预训练模型的现成零样本预测器,解决了上述问题。所提算法高效且易于实现。我们在在线批量选择场景下,针对存在显著数据噪声与类别不平衡的挑战性基准进行了大量实证研究,观察到相较于竞争基线方法更优的训练效率。值得注意的是,在极具挑战性的WebVision基准上,本方法能以显著少于主流数据选择方法的训练迭代次数达到相近的预测性能。