Public datasets, crucial for modern machine learning and statistical inference, often contain low-quality or contaminated samples that can harm model performance. This creates a need for principled prefiltering procedures that a data provider can apply to protect the accuracy of a range of potential downstream statistical and learning procedures simultaneously. In this work, we formalize and analyze Learner-Agnostic Robust data Prefiltering (LARP), the problem of designing prefiltering procedures with guarantees on the worst-case loss over a pre-specified set of learners. We establish the feasibility of LARP in two theoretical settings, by providing upper-bound guarantees on the worst-case loss. Our theoretical results indicate that protecting heterogeneous learner sets via LARP comes at the price of some performance loss compared to individual, learner-specific prefiltering; we call this gap the price of LARP. To assess this gap in performance, we empirically measure the price of LARP across image and tabular tasks. We further explore potential benefits of LARP from the perspective of saving on repeated data curation efforts, in a game-theoretic model where the downstream learners can split the cost of the single prefiltering.
翻译:公共数据集对现代机器学习和统计推断至关重要,但其中常包含低质量或受污染的样本,可能损害模型性能。这要求数据提供者采用有原则的预过滤程序,同时保护一系列潜在下游统计与学习过程的准确性。本文形式化并分析了"学习者无关的鲁棒数据预过滤"(LARP)问题,即在预先指定的学习者集合上设计具有最差损失保证的预过滤程序。我们在两个理论设置中证明了LARP的可行性,并提供了最差损失的上界保证。理论结果表明,与针对个体的、特定学习者的预过滤相比,通过LARP保护异构学习者集合需要以一定的性能损失为代价,我们将此差距称为LARP代价。为评估该性能差距,我们通过图像和表格任务实验测算了LARP代价。此外,我们进一步在博弈论模型下探讨了LARP的潜在优势——下游学习者可通过分摊单次预过滤成本来节省重复数据集整理的工作。