Mislabeled, duplicated, or biased data in real-world scenarios can lead to prolonged training and even hinder model convergence. Traditional solutions prioritizing easy or hard samples lack the flexibility to handle such a variety simultaneously. Recent work has proposed a more reasonable data selection principle by examining the data's impact on the model's generalization loss. However, its practical adoption relies on less principled approximations and additional clean holdout data. This work solves these problems by leveraging a lightweight Bayesian treatment and incorporating off-the-shelf zero-shot predictors built on large-scale pre-trained models. The resulting algorithm is efficient and easy-to-implement. We perform extensive empirical studies on challenging benchmarks with considerable data noise and imbalance in the online batch selection scenario, and observe superior training efficiency over competitive baselines. Notably, on the challenging WebVision benchmark, our method can achieve similar predictive performance with significantly fewer training iterations than leading data selection methods.
翻译:现实场景中的错误标注、重复或偏差数据会导致训练时间延长,甚至阻碍模型收敛。传统方法偏好简单或困难样本,缺乏同时处理多种数据问题的灵活性。最新研究通过分析数据对模型泛化损失的影响,提出更合理的数据选择原则,但其实际应用依赖于缺乏理论依据的近似方法及额外的干净留出数据。本研究通过引入轻量级贝叶斯处理,集成基于大规模预训练模型的现成零样本预测器,解决了上述问题。所提算法高效且易于实现。在存在显著数据噪声与不平衡的在线批量选择场景下,我们在具有挑战性的基准测试中展开大量实证研究,观察到该方法相比竞争基线方法具有更优的训练效率。值得注意的是,在具有挑战性的WebVision基准测试中,我们的方法能在显著减少训练迭代次数的条件下,达到与领先数据选择方法相当的预测性能。