Orchestrating LLM Agents for Scientific Research: A Pilot Study of Multiple Choice Question (MCQ) Generation and Evaluation

Advances in large language models (LLMs) are rapidly transforming scientific work, yet empirical evidence on how these systems reshape research activities remains limited. We report a mixed-methods pilot evaluation of an AI-orchestrated research workflow in which a human researcher coordinated multiple LLM-based agents to perform data extraction, corpus construction, artifact generation, and artifact evaluation. Using the generation and assessment of multiple-choice questions (MCQs) as a testbed, we collected 1,071 SAT Math MCQs and employed LLM agents to extract questions from PDFs, retrieve and convert open textbooks into structured representations, align each MCQ with relevant textbook content, generate new MCQs under specified difficulty and cognitive levels, and evaluate both original and generated MCQs using a 24-criterion quality framework. Across all evaluations, average MCQ quality was high. However, criterion-level analysis and equivalence testing show that generated MCQs are not fully comparable to expert-vetted baseline questions. Strict similarity (24/24 criteria equivalent) was never achieved. Persistent gaps concentrated in skill\ depth, cognitive engagement, difficulty calibration, and metadata alignment, while surface-level qualities, such as {grammar fluency}, {clarity options}, {no duplicates}, were consistently strong. Beyond MCQ outcomes, the study documents a labor shift. The researcher's work moved from ``authoring items'' toward {specification, orchestration, verification}, and {governance}. Formalizing constraints, designing rubrics, building validation loops, recovering from tool failures, and auditing provenance constituted the primary activities. We discuss implications for the future of scientific work, including emerging ``AI research operations'' skills required for AI-empowered research pipelines.

翻译：大型语言模型（LLM）的进展正在迅速改变科研工作，然而关于这些系统如何重塑研究活动的实证证据仍然有限。我们报告了一项采用混合方法的试点评估，研究了一个由人工智能协调的科研工作流程：一位人类研究者协调多个基于LLM的代理，执行数据提取、语料库构建、工件生成和工件评估。以多项选择题（MCQ）的生成与评估为测试平台，我们收集了1,071道SAT数学MCQ，并利用LLM代理从PDF中提取题目、检索并将开放教科书转换为结构化表示、将每道MCQ与相关教科书内容对齐、在指定的难度和认知水平下生成新的MCQ，以及使用一个包含24项标准的质量框架评估原始和生成的MCQ。在所有评估中，MCQ的平均质量较高。然而，标准层面的分析和等效性检验表明，生成的MCQ与经过专家审核的基线题目并不完全可比。严格的相似性（24/24项标准等效）从未实现。持续的差距主要集中在技能深度、认知参与度、难度校准和元数据对齐方面，而表面质量，如{语法流畅性}、{选项清晰度}、{无重复项}，则始终表现强劲。除了MCQ结果，本研究还记录了一种劳动力转变。研究者的工作从“编写题目”转向了{规范制定、流程协调、验证}和{治理}。形式化约束、设计评分标准、构建验证循环、从工具故障中恢复以及审计来源构成了主要活动。我们讨论了这对未来科研工作的启示，包括AI赋能的研究流程所需的新兴“AI研究运维”技能。