Cognitively Diverse Multiple-Choice Question Generation: A Hybrid Multi-Agent Framework with Large Language Models

Recent advances in large language models (LLMs) have made automated multiple-choice question (MCQ) generation increasingly feasible; however, reliably producing items that satisfy controlled cognitive demands remains a challenge. To address this gap, we introduce ReQUESTA, a hybrid, multi-agent framework for generating cognitively diverse MCQs that systematically target text-based, inferential, and main idea comprehension. ReQUESTA decomposes MCQ authoring into specialized subtasks and coordinates LLM-powered agents with rule-based components to support planning, controlled generation, iterative evaluation, and post-processing. We evaluated the framework in a large-scale reading comprehension study using academic expository texts, comparing ReQUESTA-generated MCQs with those produced by a single-pass GPT-5 zero-shot baseline. Psychometric analyses of learner responses assessed item difficulty and discrimination, while expert raters evaluated question quality across multiple dimensions, including topic relevance and distractor quality. Results showed that ReQUESTA-generated items were consistently more challenging, more discriminative, and more strongly aligned with overall reading comprehension performance. Expert evaluations further indicated stronger alignment with central concepts and superior distractor linguistic consistency and semantic plausibility, particularly for inferential questions. These findings demonstrate that hybrid, agentic orchestration can systematically improve the reliability and controllability of LLM-based generation, highlighting workflow design as a key lever for structured artifact generation beyond single-pass prompting.

翻译：大语言模型（LLM）的最新进展使得自动化多项选择题（MCQ）生成日益可行；然而，可靠地生成满足受控认知需求的项目仍然是一个挑战。为弥补这一不足，我们提出了ReQUESTA——一种用于生成认知多样性MCQ的混合多智能体框架，该系统性地针对文本理解、推理理解和主旨理解。ReQUESTA将MCQ编写分解为专业化的子任务，并通过协调基于规则的组件与LLM驱动的智能体来支持规划、受控生成、迭代评估和后处理。我们在一项使用学术说明文的大规模阅读理解研究中评估了该框架，将ReQUESTA生成的MCQ与单次GPT-5零样本基线生成的题目进行了比较。通过对学习者作答的心理测量学分析评估了项目难度与区分度，同时专家评审从主题相关性、干扰项质量等多个维度评估了问题质量。结果表明，ReQUESTA生成的题目始终具有更高的难度、更好的区分度，并且与整体阅读理解表现的相关性更强。专家评估进一步显示，其与核心概念的契合度更高，干扰项的语言一致性和语义合理性更优，尤其在推理类问题上表现突出。这些发现证明，混合智能体编排能够系统性地提升基于LLM的生成任务的可靠性与可控性，凸显了工作流设计在超越单次提示的结构化生成任务中的关键作用。