监督微调大型语言模型使其成为编程教育中的教学代理 (Supervised Fine-Tuning LLMs to Behave as Pedagogical Agents in Programming Education)

Large language models (LLMs) are increasingly being explored in higher education, yet their effectiveness as teaching agents remains underexamined. In this paper, we present the development of GuideLM, a fine-tuned LLM designed for programming education. GuideLM has been integrated into the Debugging C Compiler (DCC), an educational C compiler that leverages LLMs to generate pedagogically sound error explanations. Previously, DCC relied on off-the-shelf OpenAI models, which, while accurate, often over-assisted students by directly providing solutions despite contrary prompting. To address this, we employed supervised fine-tuning (SFT) on a dataset of 528 student-question/teacher-answer pairs, creating two models: GuideLM and GuideLM-mini, fine-tuned on ChatGPT-4o and 4o-mini, respectively. We conducted an expert analysis of 400 responses per model, comparing their pedagogical effectiveness against base OpenAI models. Our evaluation, grounded in constructivism and cognitive load theory, assessed factors such as conceptual scaffolding, clarity, and Socratic guidance. Results indicate that GuideLM and GuideLM-mini improve pedagogical performance, with an 8% increase in Socratic guidance and a 58% improvement in economy of words compared to GPT-4o. However, this refinement comes at the cost of a slight reduction in general accuracy. While further work is needed, our findings suggest that fine-tuning LLMs with targeted datasets is a promising approach for developing models better suited to educational contexts.

翻译：大型语言模型（LLM）在高等教育中的应用日益广泛，但其作为教学代理的有效性仍未得到充分检验。本文介绍了GuideLM的开发过程，这是一种专为编程教育设计的微调大型语言模型。GuideLM已集成至Debugging C Compiler（DCC）——一款利用LLM生成符合教学原理的错误解释的教育型C语言编译器。此前，DCC依赖于现成的OpenAI模型，这些模型虽然准确，但经常过度辅助学生，即使提示要求相反，仍直接提供解决方案。为解决此问题，我们基于528组学生问题/教师答案配对数据集进行了监督微调（SFT），创建了两个模型：GuideLM和GuideLM-mini，分别基于ChatGPT-4o和4o-mini进行微调。我们对每个模型的400条回复进行了专家分析，将其教学效果与基础OpenAI模型进行比较。我们的评估以建构主义与认知负荷理论为基础，评估了概念支架、清晰度及苏格拉底式引导等因素。结果表明，GuideLM和GuideLM-mini提升了教学性能：与GPT-4o相比，苏格拉底式引导提高了8%，语言精炼度提升了58%。然而，这种优化是以轻微降低总体准确性为代价的。尽管仍需进一步研究，我们的发现表明，利用定向数据集对LLM进行微调，是开发更适用于教育场景模型的有效途径。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/