Human uplift studies - or studies that measure AI effects on human performance relative to a status quo, typically using randomized controlled trial (RCT) methodology - are increasingly used to inform deployment, governance, and safety decisions for frontier AI systems. While the methods underlying these studies are well-established, their interaction with the distinctive properties of frontier AI systems remains underexamined, particularly when results are used to inform high-stakes decisions. We present findings from interviews with 16 expert practitioners with experience conducting human uplift studies in domains including biosecurity, cybersecurity, education, and labor. Across interviews, experts described a recurring tension between standard causal inference assumptions and the object of study itself. Rapidly evolving AI systems, shifting baselines, heterogeneous and changing user proficiency, and porous real-world settings strain assumptions underlying internal, external, and construct validity, complicating the interpretation and appropriate use of uplift evidence. We synthesize these challenges across key stages of the human uplift research lifecycle and map them to practitioner-reported solutions, clarifying both the limits and the appropriate uses of evidence from human uplift studies in high-stakes decision-making.
翻译:人类提升研究——即通过随机对照试验(RCT)方法,测量AI相对于现状对人类表现的影响的研究——正越来越多地被用于指导前沿AI系统的部署、治理与安全决策。尽管这些研究采用的方法已较为成熟,但其与前沿AI系统特有属性之间的相互作用仍未得到充分探讨,尤其是在研究结果被用于影响高风险决策时。本研究基于对16位具有人类提升研究实践经验的专家(研究领域涵盖生物安全、网络安全、教育及劳动力市场)的访谈,总结了以下发现。访谈中,专家们普遍描述了一种标准因果推断假设与研究客体本身之间的持续张力。快速演进的AI系统、动态变化的基准线、用户能力的异质性与动态性,以及现实场景的渗透性,均对内部效度、外部效度与构念效度的基本假设构成压力,从而使得提升证据的解释与恰当运用复杂化。我们将这些挑战归纳至人类提升研究生命周期的关键阶段,并将其与从业者提出的解决方案相对应,从而阐明人类提升研究证据在高风险决策中的局限性及适用边界。