In recent years, ML researchers have wrestled with defining and improving machine learning (ML) benchmarks and datasets. In parallel, some have trained a critical lens on the ethics of dataset creation and ML research. In this position paper, we highlight the entanglement of ethics with seemingly ``technical'' or ``scientific'' decisions about the design of ML benchmarks. Our starting point is the existence of multiple overlooked structural similarities between human intelligence benchmarks and ML benchmarks. Both types of benchmarks set standards for describing, evaluating, and comparing performance on tasks relevant to intelligence -- standards that many scholars of human intelligence have long recognized as value-laden. We use perspectives from feminist philosophy of science on IQ benchmarks and thick concepts in social science to argue that values need to be considered and documented when creating ML benchmarks. It is neither possible nor desirable to avoid this choice by creating value-neutral benchmarks. Finally, we outline practical recommendations for ML benchmark research ethics and ethics review.
翻译:近年来,机器学习研究者一直在努力定义和改进机器学习(ML)基准与数据集。与此同时,部分研究者开始对数据集创建及ML研究中的伦理问题进行批判性审视。在这篇立场论文中,我们强调伦理与看似“技术性”或“科学性”的ML基准设计决策之间的交织关系。我们的出发点是,人类智能基准与ML基准之间存在着多种长期被忽视的结构性相似性。这两类基准都为描述、评估和比较与智能相关的任务绩效设定了标准——而许多人类智能学者早已认识到这些标准承载着价值判断。我们借鉴女性主义科学哲学关于IQ基准的视角以及社会科学中的“厚概念”理论,论证在创建ML基准时需要考量并记录价值因素。试图通过创建价值中立的基准来规避这一选择既不可能,也不可取。最后,我们为ML基准研究的伦理规范与伦理审查提出了实践性建议。