Large language models are now widely used for everyday learning, but the underlying interactions are typically unstructured chats rather than following a curriculum. Unlike formal online learning systems, these interactions carry no prior record of the student, so any estimate of what the student already knows must be inferred from the dialogue itself. We show that this gap is not closed by scaling models alone. Frontier and education-tuned LLMs perform poorly when asked to tutor a student over an extended session, because doing so requires three things at once. The tutor must sequence a curriculum, conduct Socratic dialogue, and infer the student's knowledge state from that dialogue. We propose separating these responsibilities. Given a student query, our system constructs a prerequisite knowledge graph in which subtopics are nodes and dependencies are edges, and frames tutoring as deciding which node to teach next and how many dialogue turns to spend on it before moving on. A lightweight PPO policy handles this sequencing decision, while an LLM conducts the Socratic exchange at the chosen node and returns a signal of student progress. Across held-out STEM and non-STEM topics, our PPO-paired tutor outperforms heuristic baselines, frontier general-purpose models, and a model specialised for Socratic dialogue: on both the rate at which students reach full curriculum mastery and the number of turns required. Explicit curriculum structure delivers gains that scaling the underlying model does not.
翻译:大型语言模型如今广泛用于日常学习,但底层交互通常是结构松散的对话,而非遵循课程体系。与传统在线学习系统不同,这些交互不包含学生的任何先验记录,因此对学习者已知内容的评估必须完全从对话本身推断。我们证明,仅通过扩大模型规模无法弥合这一差距。前沿通用模型与教育专用模型在需要持续辅导学生的对话任务中表现欠佳,因为该过程需同时满足三项要求:辅导者必须规划课程序列、开展苏格拉底式对话,并基于对话内容推断学生的知识状态。我们提出将这三项职责分离。系统在接收到学生提问后,会构建一个先决条件知识图谱(其中子主题为节点,依赖关系为边),并将辅导任务转化为决策问题:确定下一个待教学节点以及在该节点上应分配的对话轮数。轻量级PPO策略负责处理此序列决策,而LLM则在选定节点执行苏格拉底式对话交互,并输出反映学生掌握程度的信号。在留出测试的STEM与非STEM主题中,我们的PPO配对辅导器在以下指标上均优于启发式基线、前沿通用模型及苏格拉底对话专用模型:学生达到课程完全掌握程度的比率以及达成该水平所需的对话轮数。明确课程结构所带来的收益,是单纯扩大底层模型规模所无法实现的。