Since the disruption in LLM technology brought about by the release of GPT-3 and ChatGPT, LLMs have shown remarkable promise in programming-related tasks. While code generation remains a popular field of research, code evaluation using LLMs remains a problem with no conclusive solution. In this paper, we focus on LLM-based code evaluation and attempt to fill in the existing gaps. We propose multi-agentic novel approaches using question-specific rubrics tailored to the problem statement, arguing that these perform better for logical assessment than the existing approaches that use question-agnostic rubrics. To address the lack of suitable evaluation datasets, we introduce two datasets: a Data Structures and Algorithms dataset containing 150 student submissions from a popular Data Structures and Algorithms practice website, and an Object Oriented Programming dataset comprising 80 student submissions from undergraduate computer science courses. In addition to using standard metrics (Spearman Correlation, Cohen's Kappa), we additionally propose a new metric called as Leniency, which quantifies evaluation strictness relative to expert assessment. Our comprehensive analysis demonstrates that question-specific rubrics significantly enhance logical assessment of code in educational settings, providing better feedback aligned with instructional goals beyond mere syntactic correctness.
翻译:自GPT-3和ChatGPT发布引发LLM技术变革以来,大型语言模型在编程相关任务中展现出显著潜力。尽管代码生成仍是热门研究领域,但使用LLM进行代码评估仍缺乏确切的解决方案。本文聚焦于基于LLM的代码评估,致力于填补现有研究空白。我们提出采用多智能体创新方法,使用针对问题描述定制的问题特定评分标准,论证其相较于现有使用问题无关评分标准的方法,在逻辑评估方面表现更优。针对合适评估数据集的缺乏,我们引入两个数据集:一个数据结构与算法数据集,包含来自知名数据结构与算法练习网站的150份学生提交;另一个面向对象编程数据集,包含来自本科计算机科学课程的80份学生提交。除了采用标准指标(斯皮尔曼相关系数、科恩卡帕系数),我们还提出名为“宽容度”的新指标,用于量化相对于专家评估的评判严格程度。我们的综合分析表明,在教育教学场景中,问题特定评分标准能显著提升代码逻辑评估效果,提供超越单纯语法正确性、更符合教学目标的有效反馈。