Assessing the quality and impact of individual data points is critical for improving model performance and mitigating undesirable biases within the training dataset. Several data valuation algorithms have been proposed to quantify data quality, however, there lacks a systemic and standardized benchmarking system for data valuation. In this paper, we introduce OpenDataVal, an easy-to-use and unified benchmark framework that empowers researchers and practitioners to apply and compare various data valuation algorithms. OpenDataVal provides an integrated environment that includes (i) a diverse collection of image, natural language, and tabular datasets, (ii) implementations of eleven different state-of-the-art data valuation algorithms, and (iii) a prediction model API that can import any models in scikit-learn. Furthermore, we propose four downstream machine learning tasks for evaluating the quality of data values. We perform benchmarking analysis using OpenDataVal, quantifying and comparing the efficacy of state-of-the-art data valuation approaches. We find that no single algorithm performs uniformly best across all tasks, and an appropriate algorithm should be employed for a user's downstream task. OpenDataVal is publicly available at https://opendataval.github.io with comprehensive documentation. Furthermore, we provide a leaderboard where researchers can evaluate the effectiveness of their own data valuation algorithms.
翻译:评估单个数据点的质量和影响对于提升模型性能并减轻训练数据集中不良偏差至关重要。尽管已有多种数据估值算法被提出用于量化数据质量,但目前仍缺乏系统性、标准化的数据估值基准评估体系。本文提出OpenDataVal——一个易于使用的统一基准框架,可赋能研究人员和从业者应用并比较多种数据估值算法。OpenDataVal提供集成环境,包含:(i) 涵盖图像、自然语言和表格数据的多样化数据集集合;(ii) 十一种不同前沿数据估值算法的实现;以及 (iii) 可导入scikit-learn中任意模型的预测模型API。此外,我们提出四项下游机器学习任务用于评估数据值的质量。通过OpenDataVal进行基准分析,我们量化并比较了前沿数据估值方法的效果。研究发现,没有任何单一算法能在所有任务中表现一致最优,用户应根据自身下游任务选择适当算法。OpenDataVal已在https://opendataval.github.io 公开提供,并附有完整文档。同时,我们提供排行榜,供研究人员评估其自定义数据估值算法的有效性。