From Untamed Black Box to Interpretable Pedagogical Orchestration: The Ensemble of Specialized LLMs Architecture for Adaptive Tutoring

from arxiv, Accepted as a FULL paper at the 27th International Conference on Artificial Intelligence in Education (AIED 2026). 15 pages, 4 figures, 4 tables

Monolithic Large Language Models (LLMs) used in educational dialogue often behave as "black boxes," where pedagogical decisions are implicit and difficult to audit, frequently violating instructional constraints by providing answers too early. We introduce the Ensemble of Specialized LLMS (ES-LLMS) architecture that separates decision-making from wording. Pedagogical actions are selected by a deterministic rules-based orchestrator coordinating specialized agents covering tutoring, assessment, feedback, scaffolding, motivation and ethics-guided by an interpretable Bayesian Knowledge Tracing (BKT) student model. An LLM renderer surface-realizes the chosen action in natural language. This design emphasizes reliability and controllability: constraints such as "attempt-before-hint" and hint caps are enforced as explicit rules, and the system logs per-turn agent traces and constraint checks. Validation of pedagogical quality via human expert reviewers (N=6) and a multi-LLM-as-Judge panel (six state-of-the-art models) showed that ES-LLMs were preferred in 91.7% and 79.2% of cases, respectively. The architecture significantly outperformed monolithic baselines across all seven dimensions, particularly in Scaffolding & Guidance, and Trust & Explainability. Furthermore, a Monte Carlo simulation (N=2,400) exposed a "Mastery Gain Paradox," where monolithic tutors inflated short-term performance through over-assistance. In contrast, ES-LLMs achieved 100% adherence to pedagogical constraints (e.g., attempt-before-hint) and a 3.3x increase in hint efficiency. Operationally, ES-LLMs reduced costs by 54% and latency by 22% by utilizing stateless prompts. We conclude that structural decoupling is essential for transforming stochastic models into trustworthy, verifiable and resource-efficient pedagogical agents.

翻译：在教育对话中，单一的大语言模型往往表现为"黑箱"，其教学决策隐含且难以审计，常因过早提供答案而违反教学约束。我们提出专用大语言模型集成架构，将决策制定与语言表述相分离。教学行为由基于确定性规则的编排器选择，该编排器协调覆盖辅导、评估、反馈、支架式教学、动机引导和伦理约束的专用智能体，并受可解释的贝叶斯知识追踪学生模型驱动。大语言模型渲染器将所选行为以自然语言形式呈现。该设计强调可靠性与可控性：诸如"先尝试再提示"和提示次数限制等约束被编码为显式规则，系统记录每轮交互的智能体轨迹和约束检查结果。通过人类专家评审员（N=6）和多模型联合评审团（六个前沿模型）进行的教学质量验证表明，ES-LLMS在91.7%和79.2%的案例中分别获得偏好。该架构在所有七个维度上显著优于单一基线模型，尤其在支架式教学与指导、信任与可解释性方面表现突出。此外，蒙特卡洛模拟（N=2,400）揭示了"掌握增益悖论"：单一辅导模型通过过度辅助虚增短期表现。相比之下，ES-LLMS实现了100%的教学约束遵守率（如先尝试再提示）和3.3倍的提示效率提升。在运行层面，ES-LLMS通过采用无状态提示将成本降低54%、延迟降低22%。我们得出结论：结构性解耦对于将随机模型转化为可信、可验证且资源高效的教学智能体至关重要。