A Causal Framework for Estimating Heterogeneous Effects of On-Demand Tutoring

This paper introduces a scalable causal inference framework for estimating the immediate, session-level effects of on-demand human tutoring embedded within adaptive learning systems. Because students seek assistance at moments of difficulty, conventional evaluation is confounded by self-selection and time-varying knowledge states. We address these challenges by integrating principled analytic sample construction with Deep Knowledge Tracing (DKT) to estimate latent mastery, followed by doubly robust estimation using Causal Forests. Applying this framework to over 5,000 middle-school mathematics tutoring sessions, we find that requesting human tutoring increases next-problem correctness by approximately 4 percentage points and accuracy on the subsequent skill encountered by approximately 3 percentage points, suggesting that the effects of tutoring have proximal transfer across knowledge components. This effect is robust to various forms of model specification and potential unmeasured confounders. Notably, these effects exhibit significant heterogeneity across sessions and students, with session-level effect estimates ranging from $-20.25pp$ to $+19.91pp$. Our follow-up analyses suggest that typical behavioral indicators, such as student talk time, do not consistently correlate with high-impact sessions. Furthermore, treatment effects are larger for students with lower prior mastery and slightly smaller for low-SES students. This framework offers a rigorous, practical template for the evaluation and continuous improvement of on-demand human tutoring, with direct applications for emerging AI tutoring systems.

翻译：本文提出了一种可扩展的因果推断框架，用于估计自适应学习系统中嵌入的按需人工辅导的即时会话级效应。由于学生在遇到困难时会寻求帮助，传统的评估会受到自我选择和时间变化知识状态的混杂影响。我们通过将原则性分析样本构建与深度知识追踪（DKT）相结合以估计潜在掌握度，随后使用因果森林进行双重稳健估计，从而应对这些挑战。将该框架应用于超过5000次中学数学辅导会话，我们发现请求人工辅导能使下一道题的正确率提升约4个百分点，并使后续遇到技能的正确率提升约3个百分点，这表明辅导效果在知识组件间存在近端迁移。该效应对多种模型设定方式和潜在未测量混杂因素均表现出稳健性。值得注意的是，这些效应在会话和学生层面均表现出显著异质性，会话级效应估计值从$-20.25pp$到$+19.91pp$不等。我们的后续分析表明，典型的行为指标（如学生发言时间）与高影响会话之间并不存在持续的相关性。此外，对于先验掌握度较低的学生，处理效应更大，而对于低社会经济地位的学生，处理效应略小。该框架为评估和持续改进按需人工辅导提供了严谨实用的模板，并可直接应用于新兴的AI辅导系统。