指导而非协助：基于LLM的多轮规划与分层提问实现苏格拉底式代码调试 (Instruct, Not Assist: LLM-based Multi-Turn Planning and Hierarchical Questioning for Socratic Code Debugging)

Socratic questioning is an effective teaching strategy, encouraging critical thinking and problem-solving. The conversational capabilities of large language models (LLMs) show great potential for providing scalable, real-time student guidance. However, current LLMs often give away solutions directly, making them ineffective instructors. We tackle this issue in the code debugging domain with TreeInstruct, an Instructor agent guided by a novel state space-based planning algorithm. TreeInstruct asks probing questions to help students independently identify and resolve errors. It estimates a student's conceptual and syntactical knowledge to dynamically construct a question tree based on their responses and current knowledge state, effectively addressing both independent and dependent mistakes concurrently in a multi-turn interaction setting. In addition to using an existing single-bug debugging benchmark, we construct a more challenging multi-bug dataset of 150 coding problems, incorrect solutions, and bug fixes -- all carefully constructed and annotated by experts. Extensive evaluation shows TreeInstruct's state-of-the-art performance on both datasets, proving it to be a more effective instructor than baselines. Furthermore, a real-world case study with five students of varying skill levels further demonstrates TreeInstruct's ability to guide students to debug their code efficiently with minimal turns and highly Socratic questioning.

翻译：苏格拉底式提问是一种有效的教学策略，能够促进学生批判性思维和问题解决能力的发展。大型语言模型（LLM）的对话能力为提供可扩展、实时的学生指导展现了巨大潜力。然而，当前LLM往往直接给出解决方案，使其难以成为有效的教学指导者。我们在代码调试领域针对此问题提出了TreeInstruct，这是一个由新颖的基于状态空间的规划算法引导的指导型智能体。TreeInstruct通过提出探究性问题，帮助学生独立识别并修正错误。它通过评估学生的概念性知识与语法知识，依据其回答与当前知识状态动态构建问题树，从而在多轮交互场景中有效同时处理独立错误与关联错误。除了使用现有的单错误调试基准外，我们还构建了一个更具挑战性的多错误数据集，包含150个由专家精心构建与标注的编程问题、错误解决方案及错误修复方案。大量实验评估表明，TreeInstruct在两个数据集上均取得了最先进的性能，验证了其相较于基线模型更具效能的指导能力。此外，一项涵盖五名不同技能水平学生的真实案例研究进一步证明，TreeInstruct能够以最少的交互轮次和高度苏格拉底式的提问，有效指导学生高效调试其代码。