How to Trick Your AI TA: A Systematic Study of Academic Jailbreaking in LLM Code Evaluation

from arxiv, This manuscript has been withdrawn by the authors because the methodology and results have been superseded by a more rigorous framework (SPACI and AST-ASIP). The corrected and expanded findings are now available in arXiv:2601.21360. Please cite the new manuscript instead

The use of Large Language Models (LLMs) as automatic judges for code evaluation is becoming increasingly prevalent in academic environments. But their reliability can be compromised by students who may employ adversarial prompting strategies in order to induce misgrading and secure undeserved academic advantages. In this paper, we present the first large-scale study of jailbreaking LLM-based automated code evaluators in academic context. Our contributions are: (i) We systematically adapt 20+ jailbreaking strategies for jailbreaking AI code evaluators in the academic context, defining a new class of attacks termed academic jailbreaking. (ii) We release a poisoned dataset of 25K adversarial student submissions, specifically designed for the academic code-evaluation setting, sourced from diverse real-world coursework and paired with rubrics and human-graded references, and (iii) In order to capture the multidimensional impact of academic jailbreaking, we systematically adapt and define three jailbreaking metrics (Jailbreak Success Rate, Score Inflation, and Harmfulness). (iv) We comprehensively evalulate the academic jailbreaking attacks using six LLMs. We find that these models exhibit significant vulnerability, particularly to persuasive and role-play-based attacks (up to 97% JSR). Our adversarial dataset and benchmark suite lay the groundwork for next-generation robust LLM-based evaluators in academic code assessment.

翻译：大型语言模型（LLM）作为代码评估的自动评判工具在学术环境中正日益普及。但其可靠性可能因学生采用对抗性提示策略而受到损害，这些策略旨在诱导误判以获取不应得的学术优势。本文首次在学术背景下对基于LLM的自动化代码评估器进行大规模越狱研究。我们的贡献包括：（一）系统性地调整了20余种越狱策略，用于在学术背景下攻击AI代码评估器，定义了一类新型攻击——学术越狱。（二）我们发布了包含2.5万条对抗性学生提交的污染数据集，该数据集专为学术代码评估场景设计，源自多样化的真实课程作业，并配有评分标准和人评参考答案。（三）为全面衡量学术越狱的多维度影响，我们系统性地调整并定义了三种越狱度量指标（越狱成功率、分数膨胀率和危害性）。（四）我们使用六种LLM对学术越狱攻击进行了全面评估。研究发现这些模型表现出显著脆弱性，尤其对说服型和角色扮演型攻击的抵抗力较弱（越狱成功率最高达97%）。本研究的对抗性数据集与基准测试套件为开发新一代鲁棒的基于LLM的学术代码评估器奠定了基础。