The sudden emergence of large language models (LLMs) such as ChatGPT has had a disruptive impact throughout the computing education community. LLMs have been shown to excel at producing correct code to CS1 and CS2 problems, and can even act as friendly assistants to students learning how to code. Recent work shows that LLMs demonstrate unequivocally superior results in being able to explain and resolve compiler error messages -- for decades, one of the most frustrating parts of learning how to code. However, LLM-generated error message explanations have only been assessed by expert programmers in artificial conditions. This work sought to understand how novice programmers resolve programming error messages (PEMs) in a more realistic scenario. We ran a within-subjects study with $n$ = 106 participants in which students were tasked to fix six buggy C programs. For each program, participants were randomly assigned to fix the problem using either a stock compiler error message, an expert-handwritten error message, or an error message explanation generated by GPT-4. Despite promising evidence on synthetic benchmarks, we found that GPT-4 generated error messages outperformed conventional compiler error messages in only 1 of the 6 tasks, measured by students' time-to-fix each problem. Handwritten explanations still outperform LLM and conventional error messages, both on objective and subjective measures.
翻译:以ChatGPT为代表的大型语言模型(LLM)的突然涌现,对计算教育领域产生了颠覆性影响。研究表明,LLM在生成CS1和CS2课程编程问题的正确代码方面表现卓越,甚至能作为友好的辅助工具帮助学生学习编程。近期工作表明,LLM在解释和解决编译器错误信息方面展现出无可争议的优越性——而数十年来,理解编译器错误信息一直是学习编程过程中最令人沮丧的环节之一。然而,LLM生成的错误信息解释仅在人工环境下由专家程序员进行评估。本研究旨在探究新手程序员在更现实场景中如何解决编程错误信息(PEMs)。我们设计了$n$ = 106人参与的组内实验,要求学生修复六个存在缺陷的C语言程序。针对每个程序,参与者被随机分配使用以下三种方式之一进行问题修复:原始编译器错误信息、专家手写错误信息或GPT-4生成的错误信息解释。尽管在合成基准测试中展现出潜力,但通过学生解决问题的时间衡量,我们发现GPT-4生成的错误信息仅在6项任务中的1项表现优于传统编译器错误信息。无论是客观指标还是主观评价,手写解释仍然优于LLM生成和传统错误信息。