A Causal Framework for Estimating Heterogeneous Effects of On-Demand Tutoring

This paper introduces a scalable causal inference framework for estimating the immediate, session-level effects of on-demand human tutoring embedded within adaptive learning systems. Because students seek assistance at moments of difficulty, conventional evaluation is confounded by self-selection and time-varying knowledge states. We address these challenges by integrating principled analytic sample construction with Deep Knowledge Tracing (DKT) to estimate latent mastery, followed by doubly robust estimation using Causal Forests. Applying this framework to over 5,000 middle-school mathematics tutoring sessions, we find that requesting human tutoring increases next-problem correctness by approximately 4 percentage points and accuracy on the subsequent skill encountered by approximately 3 percentage points, suggesting that the effects of tutoring have proximal transfer across knowledge components. This effect is robust to various forms of model specification and potential unmeasured confounders. Notably, these effects exhibit significant heterogeneity across sessions and students, with session-level effect estimates ranging from $-20.25pp$ to $+19.91pp$. Our follow-up analyses suggest that typical behavioral indicators, such as student talk time, do not consistently correlate with high-impact sessions. Furthermore, treatment effects are larger for students with lower prior mastery and slightly smaller for low-SES students. This framework offers a rigorous, practical template for the evaluation and continuous improvement of on-demand human tutoring, with direct applications for emerging AI tutoring systems.

翻译：本文提出了一种可扩展的因果推断框架，用于评估嵌入自适应学习系统中的按需人工辅导在会话层面的即时效果。由于学生在遇到困难时寻求帮助，传统评估方法受到自选择偏差和时变知识状态的干扰。为解决这些挑战，我们整合了基于原则的分析样本构建方法与深度知识追踪（DKT）来估计潜在知识掌握度，随后采用因果森林（Causal Forests）进行双重稳健估计。将该框架应用于超过5000个中学数学辅导会话的分析表明：请求人工辅导可使下一道题的正确率提升约4个百分点，并使后续遇到技能的正确率提升约3个百分点，这提示辅导效果在知识组件间存在近迁移效应。该效应在不同模型设定形式及潜在未测量混杂因素下均保持稳健。值得注意的是，这些效应在不同会话和学生间存在显著异质性，会话层面的效应估计值介于$-20.25pp$至$+19.91pp$之间。后续分析表明，典型的行为指标（如学生发言时长）与高影响力会话并未呈现稳定相关性。此外，对于先前知识掌握度较低的学生，处理效应更大；而对低社会经济地位（SES）学生，效应略小。本框架为按需人工辅导的评估与持续改进提供了严谨实用的模板，对新兴人工智能辅导系统具有直接应用价值。