Simulating Students' Java Programming Errors with Large Language Models

Understanding student errors in the programming is a cornerstone of programming education, yet obtaining a representative set of student errors for any newly designed task remains slow and costly, since authentic submissions only accumulate after extensive classroom deployment. This paper explores whether large language models (LLMs) can serve as scalable proxies for students by simulating realistic logical errors in code submissions. Using the CodeWorkout dataset of 74,000+ unique student Java submissions across 37 problems, we evaluate five LLMs under three mainstream prompting strategies: Input-Output (IO), Chain-of-Thought (CoT), and iterative Self-Refine. We assess performance along two key dimensions: diversity (the range of distinct error patterns) and alignment (alignment with authentic student mistakes), and examine how these vary by struggling level of programming tasks. Our quantitative findings reveal that while all models generate diverse errors, their alignment to human submissions diverges: Claude Sonnet 4 achieves the most balanced performance. In addition, we conducted a blinded expert annotation study (N = 401) comparing synthetic and authentic errors. This qualitative analysis confirms that the generated errors are functionally indistinguishable from authentic student errors. Moreover, higher-struggling-level problems elicit more diverse but less student-like errors. These results highlight trade-offs in using LLMs to simulate human learners and suggest design considerations for integrating synthetic errors into teachable agents, intelligent tutoring systems, and large-scale learning analytics.

翻译：理解学生在编程中的错误是编程教育的基石，然而，对于任何新设计的任务，获取具有代表性的学生错误集仍然既缓慢又昂贵，因为真实的提交数据只有在经过广泛的课堂部署后才能积累。本文探究了大语言模型（LLMs）能否通过模拟代码提交中的真实逻辑错误，作为学生的可扩展替代方案。利用包含74,000余份学生针对37个问题的独特Java提交数据的CodeWorkout数据集，我们评估了五种LLM在三种主流提示策略下的表现：输入-输出（IO）、思维链（CoT）和迭代式自我优化。我们从两个关键维度评估性能：多样性（不同错误模式的范围）和一致性（与真实学生错误的匹配程度），并考察了这些指标如何随编程任务的困难程度而变化。我们的定量研究结果表明，虽然所有模型都能生成多样化的错误，但它们与人类提交数据的一致性存在差异：Claude Sonnet 4 取得了最均衡的表现。此外，我们开展了一项盲法专家标注研究（N = 401），对合成错误与真实错误进行了比较。这项定性分析证实，生成的错误在功能上与真实学生错误难以区分。而且，困难程度更高的任务会引发更多样化但更不似学生的错误。这些结果揭示了使用大语言模型模拟人类学习者的权衡，并为将合成错误整合到可教学代理、智能辅导系统和大规模学习分析中提供了设计考量。