The Compliance Paradox: Semantic-Instruction Decoupling in Automated Academic Code Evaluation

The rapid integration of Large Language Models (LLMs) into educational assessment rests on the unverified assumption that instruction following capability translates directly to objective adjudication. We demonstrate that this assumption is fundamentally flawed. Instead of evaluating code quality, models frequently decouple from the submission's logic to satisfy hidden directives, a systemic vulnerability we term the Compliance Paradox, where models fine-tuned for extreme helpfulness are vulnerable to adversarial manipulation. To expose this, we introduce the Semantic-Preserving Adversarial Code Injection (SPACI) Framework and the Abstract Syntax Tree-Aware Semantic Injection Protocol (AST-ASIP). These methods exploit the Syntax-Semantics Gap by embedding adversarial directives into syntactically inert regions (trivia nodes) of the Abstract Syntax Tree. Through a large-scale evaluation of 9 SOTA models across 25,000 submissions in Python, C, C++, and Java, we reveal catastrophic failure rates (>95%) in high-capacity open-weights models like DeepSeek-V3, which systematically prioritize hidden formatting constraints over code correctness. We quantify this failure using our novel tripartite framework measuring Decoupling Probability, Score Divergence, and Pedagogical Severity to demonstrate the widespread "False Certification" of functionally broken code. Our findings suggest that current alignment paradigms create a "Trojan" vulnerability in automated grading, necessitating a shift from standard RLHF toward domain-specific Adjudicative Robustness, where models are conditioned to prioritize evidence over instruction compliance. We release our complete dataset and injection framework to facilitate further research on the topic.

翻译：大型语言模型（LLM）在教育评估中的快速应用基于一个未经证实的假设：指令遵循能力可直接转化为客观评判。我们证明这一假设存在根本缺陷。模型并非评估代码质量，而常常脱离提交代码的逻辑以满足隐含指令，这种系统性漏洞我们称之为“合规性悖论”——即经过极端助人性微调的模型易受对抗性操纵。为揭示此问题，我们提出语义保持对抗性代码注入（SPACI）框架与抽象语法树感知语义注入协议（AST-ASIP）。这些方法通过将对抗性指令嵌入抽象语法树的语法惰性区域（琐碎节点），利用语法-语义间隙进行攻击。通过对Python、C、C++和Java四种语言的25,000份代码提交进行9个前沿模型的大规模评估，我们发现以DeepSeek-V3为代表的高容量开源权重模型存在灾难性故障率（>95%），其系统性地将隐藏格式约束置于代码正确性之上。我们通过新颖的三元框架——解耦概率、分数偏离度与教学严重性——量化这种故障，证明功能损坏代码被广泛“错误认证”。研究结果表明，当前对齐范式在自动评分中形成了“特洛伊木马”式漏洞，亟需从标准RLHF转向领域特定的裁决鲁棒性范式，使模型训练为优先依据证据而非指令遵从。我们公开完整数据集与注入框架，以推动该领域的进一步研究。