(Human) Attention Is (Still) All You Need: Human oversight makes AI-assisted social science reliable

Large language models (LLMs) are increasingly used for tasks once reserved for trained researchers, including hypothesis generation, specification choice, and drafting conclusions. We argue that the reliability of AI-assisted research depends not only on model capability, but also on how cognitive labour is structured between humans and machines. We study this problem through Human-in-the-Loop Economic Research (HLER), a decision architecture based on pre-commitment, decision sequencing, accountability, and attention allocation. In a pre-specified 2*4 factorial experiment with 280 complete research runs across four datasets, an unconstrained multi-agent baseline produced critical failures in 72% of runs. Using the same underlying model, the same agent decomposition, and identical prompts for the shared reasoning agents, HLER reduced the failure rate to 16% by imposing three architectural commitments: LLMs reason but do not execute data work, data and estimation are handled deterministically, and three human decision gates bind the workflow. Fisher's exact test rejects equality of failure rates at p<0.001. Reliability gains were largest on the least publicly represented dataset, a Qing-dynasty population register, consistent with a task-based production model with Frechet-distributed output quality. An 80-run ablation suggests that deterministic computation and human gates contribute independently, with exploratory evidence of complementarity. We interpret HLER as a research harness rather than an autonomous AI scientist: it sharply reduces failures, makes residual weaknesses more visible, and prevents unreliable claims from being advanced as publication-ready outputs.

翻译：大型语言模型(LLM)正日益被用于此前只有经过训练的研究人员才能完成的任务，包括假设生成、规范选择以及结论起草。我们认为，AI辅助研究的可靠性不仅取决于模型能力，还取决于人类与机器之间认知劳动的组织方式。我们通过将人置于回路的经济研究(HLER)这一基于预先承诺、决策排序、责任归属和注意力分配的决策架构来研究该问题。在一项预先设定的2*4析因实验中，涵盖了四个数据集的280次完整研究运行，无约束的多智能体基线在72%的运行中出现了严重故障。使用相同的基础模型、相同的智能体分解方式以及共享推理智能体的相同提示词，HLER通过施加三种架构性承诺将故障率降至16%：LLM负责推理但不执行数据处理工作，数据和估计通过确定性方式处理，以及三个人类决策门控制工作流程。Fisher精确检验在p<0.001水平下拒绝了故障率相等的假设。可靠性提升在公开呈现最少的数据集（一份清代人口登记册）上最为显著，这符合基于任务且产出质量服从弗雷歇分布的生产模型。一项80次运行的消融研究表明，确定性计算和人类决策门各自独立发挥作用，并存在互补性的探索性证据。我们将HLER视为一种研究约束框架而非自主AI科学家：它大幅降低了故障，使剩余弱点更加明显，并阻止不可靠的主张被作为可发表的成果提交。

相关内容

关注 7111

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

人工智能大语言模型引擎如何重塑全球冲突信息环境最新50页

专知会员服务

8+阅读 · 7月29日

《知识增强型大语言模型及面向创造力支持的人机协作框架》233页

专知会员服务

34+阅读 · 2025年9月29日

“人工智能科学家距离改变世界还有多远？”

专知会员服务

24+阅读 · 2025年8月1日

AI4Research：科学研究中的人工智能综述

专知会员服务

38+阅读 · 2025年7月4日