This work establishes a foundational framework for standardizing AI evaluation RCTs (sometimes called human uplift studies). Drawing on established experimental practices from disciplines with established RCT traditions, including software engineering, economics, clinical and health sciences, and psychology, we adopt the (Shadish et al., 2002) four-validity framework and extend it with a fifth principle on transparency, repeatability, and verification adapted from the Transparency and Openness Promotion (TOP) Guidelines (Center for Open Science, 2025). We operationalize all five principles into 33 guidelines adapted for AI evaluation RCT contexts, expressed as requirements with rationales, implementation instructions, and evidence bases. We position the principles and guidelines as serving three key roles for AI evaluation RCTs: a design tool for planning studies, an evaluation rubric for assessing existing work, and a blueprint for standard setting as the field converges on norms. Our framework extends prior work by centering evaluation on human performance rather than model output alone, formalizing causal inference through RCT methodology for AI contexts, integrating heterogeneity analysis and practical significance assessment, implementing a graded transparency and repeatability framework, and addressing AI-specific challenges including model versioning, human-AI interaction dynamics, contamination and spillover effects, and equitable impact assessment.
翻译:本文建立了一个用于标准化人工智能评估随机对照试验(RCT,有时称为人类提升研究)的基础框架。借鉴软件工程、经济学、临床与健康科学、心理学等具有成熟RCT传统的学科中的既定实验实践,我们采用(Shadish等,2002)四效度框架,并在此基础上扩展第五项原则——针对透明度、可重复性与可验证性,该原则改编自《透明度与开放性促进(TOP)指南》(开放科学中心,2025)。我们将全部五项原则转化为33条针对人工智能评估RCT场景的指南,以要求、理由、实施指南及证据基础的形式呈现。我们将这些原则与指南定位为服务于人工智能评估RCT的三个关键角色:研究设计的规划工具、现有工作评估的评审标准,以及该领域在规范收敛过程中进行标准制定的蓝图。我们的框架通过以下方式扩展了先前工作:以人类绩效而非仅模型输出作为评估核心;通过RCT方法论形式化人工智能场景下的因果推断;整合异质性分析与实际显著性评估;实施分级透明度与可重复性框架;以及应对人工智能特有的挑战,包括模型版本管理、人机交互动态、污染效应与溢出效应,以及公平性影响评估。