Decoding Logic Errors: A Comparative Study on Bug Detection by Students and Large Language Models

Identifying and resolving logic errors can be one of the most frustrating challenges for novices programmers. Unlike syntax errors, for which a compiler or interpreter can issue a message, logic errors can be subtle. In certain conditions, buggy code may even exhibit correct behavior -- in other cases, the issue might be about how a problem statement has been interpreted. Such errors can be hard to spot when reading the code, and they can also at times be missed by automated tests. There is great educational potential in automatically detecting logic errors, especially when paired with suitable feedback for novices. Large language models (LLMs) have recently demonstrated surprising performance for a range of computing tasks, including generating and explaining code. These capabilities are closely linked to code syntax, which aligns with the next token prediction behavior of LLMs. On the other hand, logic errors relate to the runtime performance of code and thus may not be as well suited to analysis by LLMs. To explore this, we investigate the performance of two popular LLMs, GPT-3 and GPT-4, for detecting and providing a novice-friendly explanation of logic errors. We compare LLM performance with a large cohort of introductory computing students $(n=964)$ solving the same error detection task. Through a mixed-methods analysis of student and model responses, we observe significant improvement in logic error identification between the previous and current generation of LLMs, and find that both LLM generations significantly outperform students. We outline how such models could be integrated into computing education tools, and discuss their potential for supporting students when learning programming.

翻译：识别并解决逻辑错误对于编程初学者而言可能最具挑战性。与编译器或解释器能给出提示的语法错误不同，逻辑错误往往难以察觉。在某些情况下，存在缺陷的代码甚至可能表现出正确行为——另一些情况下，问题可能源于对问题陈述的解读方式。此类错误在阅读代码时难以发现，有时也会被自动化测试遗漏。自动检测逻辑错误具有巨大的教育潜力，尤其是当与适合初学者的反馈相结合时。大型语言模型（LLMs）近期在生成和解释代码等一系列计算任务中展现出惊人的性能。这些能力与代码语法紧密相关，这与LLMs的下一词元预测行为相吻合。另一方面，逻辑错误涉及代码的运行时执行，因此可能并不适合由LLMs进行分析。为探究此问题，我们研究了两种流行的大语言模型（GPT-3和GPT-4）在检测逻辑错误并提供适合初学者的解释方面的表现。我们将LLM的性能与一组大规模的计算机入门课程学生（n=964）在完成相同缺陷检测任务时的表现进行了比较。通过对学生和模型回答的混合方法分析，我们观察到新一代LLM相比前代在逻辑错误识别方面有显著提升，并发现两代LLM均显著优于学生。我们概述了如何将这些模型整合到计算机教育工具中，并讨论了其在支持学生学习编程方面的潜力。