Large language models (LLMs) have demonstrated remarkable potential for automatic short answer grading (ASAG), significantly boosting student assessment efficiency and scalability in educational scenarios. However, their vulnerability to adversarial manipulation raises critical concerns about automatic grading fairness and reliability. In this paper, we introduce GradingAttack, a fine-grained adversarial attack framework that systematically evaluates the vulnerability of LLM based ASAG models. Specifically, we align general-purpose attack methods with the specific objectives of ASAG by designing token-level and prompt-level strategies that manipulate grading outcomes while maintaining high camouflage. Furthermore, to quantify attack camouflage, we propose a novel evaluation metric that balances attack success and camouflage. Experiments on multiple datasets demonstrate that both attack strategies effectively mislead grading models, with prompt-level attacks achieving higher success rates and token-level attacks exhibiting superior camouflage capability. Our findings underscore the need for robust defenses to ensure fairness and reliability in ASAG. Our code and datasets are available at https://anonymous.4open.science/r/GradingAttack.
翻译:大语言模型(LLMs)在自动短答案评分(ASAG)方面展现出显著潜力,极大提升了教育场景中学生评估的效率和可扩展性。然而,其对抗性操纵的脆弱性引发了关于自动评分公平性与可靠性的严重关切。本文提出GradingAttack,一种细粒度对抗攻击框架,用于系统评估基于LLM的ASAG模型的脆弱性。具体而言,我们通过设计令牌级和提示级攻击策略,将通用攻击方法与ASAG的具体目标对齐,在操纵评分结果的同时保持高隐蔽性。此外,为量化攻击隐蔽性,我们提出了一种平衡攻击成功率和隐蔽性的新型评估指标。在多个数据集上的实验表明,两种攻击策略均能有效误导评分模型,其中提示级攻击成功率更高,而令牌级攻击则表现出更优的隐蔽能力。我们的研究结果强调了构建鲁棒防御机制以确保ASAG公平性与可靠性的必要性。代码与数据集公开于:https://anonymous.4open.science/r/GradingAttack。