Large Language Model (LLM) agents have shown promising potential in automating Instructional Systems Design (ISD), a systematic approach to developing educational programs. However, evaluating these agents remains challenging due to the lack of standardized benchmarks and the risk of LLM-as-judge bias. We present ISD-Agent-Bench, a comprehensive benchmark comprising 25,795 scenarios generated via a Context Matrix framework that combines 51 contextual variables across 5 categories with 33 ISD sub-steps derived from the ADDIE model. To ensure evaluation reliability, we employ a multi-judge protocol using diverse LLMs from different providers, achieving high inter-judge reliability. We compare existing ISD agents with novel agents grounded in classical ISD theories such as ADDIE, Dick \& Carey, and Rapid Prototyping ISD. Experiments on 1,017 test scenarios demonstrate that integrating classical ISD frameworks with modern ReAct-style reasoning achieves the highest performance, outperforming both pure theory-based agents and technique-only approaches. Further analysis reveals that theoretical quality strongly correlates with benchmark performance, with theory-based agents showing significant advantages in problem-centered design and objective-assessment alignment. Our work provides a foundation for systematic LLM-based ISD research.
翻译:大语言模型(LLM)智能体在自动化教学系统设计(ISD)——一种系统化的教育项目开发方法——方面展现出巨大潜力。然而,由于缺乏标准化基准以及存在LLM-as-judge偏差的风险,评估这些智能体仍然具有挑战性。我们提出了ISD-Agent-Bench,这是一个综合性基准,包含25,795个通过上下文矩阵框架生成的场景。该框架将5个类别下的51个上下文变量与源自ADDIE模型的33个ISD子步骤相结合。为确保评估的可靠性,我们采用了多评委协议,使用来自不同供应商的多样化LLM作为评委,并实现了较高的评委间一致性。我们将现有的ISD智能体与基于经典ISD理论(如ADDIE、Dick & Carey以及快速原型设计ISD)的新型智能体进行了比较。在1,017个测试场景上的实验表明,将经典ISD框架与现代ReAct式推理相结合,能够实现最佳性能,优于纯理论驱动的智能体和纯技术驱动的方法。进一步分析表明,理论质量与基准测试性能高度相关,基于理论的智能体在以问题为中心的设计和目标-评估一致性方面表现出显著优势。我们的工作为系统化的基于LLM的ISD研究奠定了基础。