StepGrade：基于情境感知大语言模型的编程作业评分系统 (StepGrade: Grading Programming Assignments with Context-Aware LLMs)

Grading programming assignments is a labor-intensive and time-consuming process that demands careful evaluation across multiple dimensions of the code. To overcome these challenges, automated grading systems are leveraged to enhance efficiency and reduce the workload on educators. Traditional automated grading systems often focus solely on correctness, failing to provide interpretable evaluations or actionable feedback for students. This study introduces StepGrade, which explores the use of Chain-of-Thought (CoT) prompting with Large Language Models (LLMs) as an innovative solution to address these challenges. Unlike regular prompting, which offers limited and surface-level outputs, CoT prompting allows the model to reason step-by-step through the interconnected grading criteria, i.e., functionality, code quality, and algorithmic efficiency, ensuring a more comprehensive and transparent evaluation. This interconnectedness necessitates the use of CoT to systematically address each criterion while considering their mutual influence. To empirically validate the efficiency of StepGrade, we conducted a case study involving 30 Python programming assignments across three difficulty levels (easy, intermediate, and advanced). The approach is validated against expert human evaluations to assess its consistency, accuracy, and fairness. Results demonstrate that CoT prompting significantly outperforms regular prompting in both grading quality and interpretability. By reducing the time and effort required for manual grading, this research demonstrates the potential of GPT-4 with CoT prompting to revolutionize programming education through scalable and pedagogically effective automated grading systems.

翻译：编程作业评分是一项劳动密集型且耗时的过程，需要对代码的多个维度进行仔细评估。为应对这些挑战，自动化评分系统被用于提升效率并减轻教育工作者的负担。传统自动化评分系统往往仅关注正确性，无法为学生提供可解释的评估或可操作的反馈。本研究提出StepGrade，探索将思维链提示与大语言模型结合使用，作为解决这些挑战的创新方案。相较于仅产生有限表层输出的常规提示，思维链提示使模型能够通过相互关联的评分标准（即功能性、代码质量和算法效率）进行逐步推理，确保评估更全面透明。这种关联性要求使用思维链来系统处理每个标准，同时考虑其相互影响。为实证验证StepGrade的效能，我们开展了案例研究，涵盖三个难度级别（简单、中级、高级）的30份Python编程作业。该方法通过专家人工评估进行验证，以衡量其一致性、准确性与公平性。结果表明，思维链提示在评分质量和可解释性方面均显著优于常规提示。通过减少人工评分所需的时间和精力，本研究证明了GPT-4结合思维链提示具有通过可扩展且符合教学原理的自动化评分系统革新编程教育的潜力。