Assessing the quality and impact of individual data points is critical for improving model performance and mitigating undesirable biases within the training dataset. Several data valuation algorithms have been proposed to quantify data quality, however, there lacks a systemic and standardized benchmarking system for data valuation. In this paper, we introduce OpenDataVal, an easy-to-use and unified benchmark framework that empowers researchers and practitioners to apply and compare various data valuation algorithms. OpenDataVal provides an integrated environment that includes (i) a diverse collection of image, natural language, and tabular datasets, (ii) implementations of eleven different state-of-the-art data valuation algorithms, and (iii) a prediction model API that can import any models in scikit-learn. Furthermore, we propose four downstream machine learning tasks for evaluating the quality of data values. We perform benchmarking analysis using OpenDataVal, quantifying and comparing the efficacy of state-of-the-art data valuation approaches. We find that no single algorithm performs uniformly best across all tasks, and an appropriate algorithm should be employed for a user's downstream task. OpenDataVal is publicly available at https://opendataval.github.io with comprehensive documentation. Furthermore, we provide a leaderboard where researchers can evaluate the effectiveness of their own data valuation algorithms.
翻译:评估单个数据点的质量和影响对于提升模型性能以及减轻训练数据集中不良偏差至关重要。已有多种数据价值评估算法被提出用于量化数据质量,然而,当前缺乏系统化、标准化的数据价值评估基准框架。本文提出OpenDataVal,一个易于使用且统一的基准框架,可赋能研究人员和从业者应用并比较多种数据价值评估算法。OpenDataVal提供集成环境,包括:(i) 涵盖图像、自然语言和表格数据的多样化数据集;(ii) 十一种最先进数据价值评估算法的实现;(iii) 可导入scikit-learn中任意模型的预测模型API。此外,我们提出了四项用于评估数据价值质量的下游机器学习任务。利用OpenDataVal进行基准分析,我们量化并比较了最先进数据价值评估方法的有效性。研究发现,没有任何单一算法在所有任务中表现一致最优,用户应根据具体下游任务选择合适算法。OpenDataVal已在https://opendataval.github.io公开发布,并提供详尽文档。进一步地,我们提供排行榜,使研究人员能够评估自身数据价值评估算法的有效性。