In machine learning research, it is common to evaluate algorithms via their performance on standard benchmark datasets. While a growing body of work establishes guidelines for -- and levies criticisms at -- data and benchmarking practices in machine learning, comparatively less attention has been paid to the data repositories where these datasets are stored, documented, and shared. In this paper, we analyze the landscape of these $\textit{benchmark data repositories}$ and the role they can play in improving benchmarking. This role includes addressing issues with both datasets themselves (e.g., representational harms, construct validity) and the manner in which evaluation is carried out using such datasets (e.g., overemphasis on a few datasets and metrics, lack of reproducibility). To this end, we identify and discuss a set of considerations surrounding the design and use of benchmark data repositories, with a focus on improving benchmarking practices in machine learning.
翻译:在机器学习研究中,通过算法在标准基准数据集上的表现进行评估是常见做法。尽管越来越多的研究为机器学习中的数据与基准测试实践建立了指导原则并提出了批评,但相对而言,人们对存储、记录和共享这些数据集的**基准数据仓库**关注较少。本文分析了这些基准数据仓库的现状及其在改进基准测试中可能发挥的作用。这种作用包括解决数据集自身存在的问题(例如表征性偏差、结构效度)以及使用此类数据集进行评估时的方式问题(例如过度依赖少数数据集和指标、可复现性不足)。为此,我们围绕基准数据仓库的设计与使用提出并讨论了一系列考量因素,重点关注如何改进机器学习的基准测试实践。