Background and Context: Over the past year, large language models (LLMs) have taken the world by storm. In computing education, like in other walks of life, many opportunities and threats have emerged as a consequence. Objectives: In this article, we explore such opportunities and threats in a specific area: responding to student programmers' help requests. More specifically, we assess how good LLMs are at identifying issues in problematic code that students request help on. Method: We collected a sample of help requests and code from an online programming course. We then prompted two different LLMs (OpenAI Codex and GPT-3.5) to identify and explain the issues in the students' code and assessed the LLM-generated answers both quantitatively and qualitatively. Findings: GPT-3.5 outperforms Codex in most respects. Both LLMs frequently find at least one actual issue in each student program (GPT-3.5 in 90% of the cases). Neither LLM excels at finding all the issues (GPT-3.5 finding them 57% of the time). False positives are common (40% chance for GPT-3.5). The advice that the LLMs provide on the issues is often sensible. The LLMs perform better on issues involving program logic rather than on output formatting. Model solutions are frequently provided even when the LLM is prompted not to. LLM responses to prompts in a non-English language are only slightly worse than responses to English prompts. Implications: Our results continue to highlight the utility of LLMs in programming education. At the same time, the results highlight the unreliability of LLMs: LLMs make some of the same mistakes that students do, perhaps especially when formatting output as required by automated assessment systems. Our study informs teachers interested in using LLMs as well as future efforts to customize LLMs for the needs of programming education.
翻译:背景与现状:过去一年中,大型语言模型(LLMs)席卷全球。与各行各业一样,计算教育领域也随之涌现出众多机遇与挑战。目标:本文聚焦于一个特定领域——回应学生编程者的求助请求,探讨此类机遇与挑战。具体而言,我们评估了LLMs在识别学生所求助的问题代码中的缺陷方面的能力。方法:我们从一门在线编程课程中收集了求助请求及对应代码样本,随后分别提示两个不同的LLM(OpenAI Codex与GPT-3.5)识别并解释学生代码中的问题,并对LLM生成的回答进行了定量与定性评估。发现:GPT-3.5在多数方面优于Codex。两个LLM均能频繁识别每个学生程序中的至少一个实际缺陷(GPT-3.5在90%的案例中成功)。但两者在找出所有缺陷方面表现欠佳(GPT-3.5仅能在57%的情况下完成)。误报现象普遍(GPT-3.5的误报概率为40%)。LLMs针对缺陷提供的建议通常合理,其在程序逻辑相关问题上表现优于输出格式问题。即便提示禁止提供模型解决方案,LLMs仍常默认输出此类内容。针对非英语提示的响应仅略逊于英语提示。启示:我们的结果再次凸显了LLMs在编程教育中的实用性,同时也暴露出其不可靠性:LLMs会犯与学生相同的错误,尤其在按自动评分系统要求格式化输出时。本研究成果可为有意使用LLMs的教师提供参考,并为未来针对编程教育需求定制LLMs的工作奠定基础。