The ethical ambiguity of AI data enrichment: Measuring gaps in research ethics norms and practices

The technical progression of artificial intelligence (AI) research has been built on breakthroughs in fields such as computer science, statistics, and mathematics. However, in the past decade AI researchers have increasingly looked to the social sciences, turning to human interactions to solve the challenges of model development. Paying crowdsourcing workers to generate or curate data, or data enrichment, has become indispensable for many areas of AI research, from natural language processing to reinforcement learning from human feedback (RLHF). Other fields that routinely interact with crowdsourcing workers, such as Psychology, have developed common governance requirements and norms to ensure research is undertaken ethically. This study explores how, and to what extent, comparable research ethics requirements and norms have developed for AI research and data enrichment. We focus on the approach taken by two leading conferences: ICLR and NeurIPS, and journal publisher Springer. In a longitudinal study of accepted papers, and via a comparison with Psychology and CHI papers, this work finds that leading AI venues have begun to establish protocols for human data collection, but these are are inconsistently followed by authors. Whilst Psychology papers engaging with crowdsourcing workers frequently disclose ethics reviews, payment data, demographic data and other information, similar disclosures are far less common in leading AI venues despite similar guidance. The work concludes with hypotheses to explain these gaps in research ethics practices and considerations for its implications.

翻译：人工智能（AI）研究的技术进步建立在计算机科学、统计学和数学等领域的突破之上。然而，过去十年间，AI研究者日益关注社会科学领域，借助人类交互来解决模型开发中的挑战。通过雇佣众包工作者生成或整理数据（即数据增强），已成为AI研究从自然语言处理到基于人类反馈的强化学习（RLHF）等多个领域不可或缺的环节。其他常规与众包工作者互动的学科（如心理学）已建立起通用治理要求和规范，以确保研究在伦理框架内开展。本研究探讨了针对AI研究与数据增强，类似的研究伦理要求与规范在多大程度上已形成。我们聚焦于两个顶级会议（ICLR和NeurIPS）及期刊出版商Springer的实践方式。通过对已接收论文的纵向研究，并与心理学及CHI会议论文进行比较，本研究发现：顶级AI学术平台已开始建立人类数据收集的规范流程，但作者对这些规范的遵循程度存在差异。尽管涉及众包工作者的心理学论文频繁披露伦理审查、薪酬数据、人口统计信息及其他内容，而在倡导类似规范的顶级AI学术平台上，此类披露却远未普及。本研究最后提出用以解释研究伦理实践差距的若干假设，并对其相关启示进行了探讨。