Assessing the quality and impact of individual data points is critical for improving model performance and mitigating undesirable biases within the training dataset. Several data valuation algorithms have been proposed to quantify data quality, however, there lacks a systemic and standardized benchmarking system for data valuation. In this paper, we introduce OpenDataVal, an easy-to-use and unified benchmark framework that empowers researchers and practitioners to apply and compare various data valuation algorithms. OpenDataVal provides an integrated environment that includes (i) a diverse collection of image, natural language, and tabular datasets, (ii) implementations of nine different state-of-the-art data valuation algorithms, and (iii) a prediction model API that can import any models in scikit-learn. Furthermore, we propose four downstream machine learning tasks for evaluating the quality of data values. We perform benchmarking analysis using OpenDataVal, quantifying and comparing the efficacy of state-of-the-art data valuation approaches. We find that no single algorithm performs uniformly best across all tasks, and an appropriate algorithm should be employed for a user's downstream task. OpenDataVal is publicly available at https://opendataval.github.io with comprehensive documentation. Furthermore, we provide a leaderboard where researchers can evaluate the effectiveness of their own data valuation algorithms.
翻译:评估单个数据点的质量与影响对于提升模型性能、缓解训练数据集中不良偏差至关重要。尽管已有多种数据估值算法被提出以量化数据质量,但目前仍缺乏系统化、标准化的数据估值基准评估体系。本文提出OpenDataVal——一个易用且统一的基准评估框架,使研究人员和从业者能够便捷地应用和比较各类数据估值算法。该框架提供集成环境,包含:(i) 涵盖图像、自然语言和表格数据的多样化数据集集合;(ii) 九种前沿数据估值算法的实现;(iii) 可导入scikit-learn任意模型的预测模型应用编程接口。此外,我们提出四项下游机器学习任务用于评估数据值的质量。通过OpenDataVal进行基准分析,我们量化并比较了当前最优数据估值方法的效果。研究发现,没有任何单一算法能在所有任务中保持最优性能,而应根据用户的具体下游任务选择合适算法。OpenDataVal已公开于 https://opendataval.github.io,并提供详细文档。同时,我们设立了排行榜供研究人员评估其自主研发的数据估值算法的有效性。