As large language models (LLMs) are increasingly deployed as automated graders in educational settings, concerns about fairness and bias in their evaluations have become critical. This study investigates whether LLMs exhibit implicit grading bias based on writing style when the underlying content correctness remains constant. We constructed a controlled dataset of 180 student responses across three subjects (Mathematics, Programming, and Essay/Writing), each with three surface-level perturbation types: grammar errors, informal language, and non-native phrasing. Two state-of-the-art open-source LLMs -- LLaMA 3.3 70B (Meta) and Qwen 2.5 72B (Alibaba) -- were prompted to grade responses on a 1-10 scale with explicit instructions to evaluate content correctness only and to disregard writing style. Our results reveal statistically significant grading bias in Essay/Writing tasks across both models and all perturbation types (p < 0.05), with effect sizes ranging from medium (Cohen's d = 0.64) to very large (d = 4.25). Informal language received the heaviest penalty, with LLaMA deducting an average of 1.90 points and Qwen deducting 1.20 points on a 10-point scale -- penalties comparable to the difference between a B+ and C+ letter grade. Non-native phrasing was penalized 1.35 and 0.90 points respectively. In sharp contrast, Mathematics and Programming tasks showed minimal bias, with most conditions failing to reach statistical significance. These findings demonstrate that LLM grading bias is subject-dependent, style-sensitive, and persists despite explicit counter-bias instructions in the grading prompt. We discuss implications for equitable deployment of LLM-based grading systems and recommend bias auditing protocols before institutional adoption.
翻译:随着大型语言模型(LLMs)在教育场景中作为自动评分工具的广泛应用,其评估的公平性与偏差问题变得至关重要。本研究旨在探究当底层内容正确性保持恒定时,LLMs是否因写作风格而表现出隐式评分偏差。我们构建了一个包含180份学生回答的受控数据集,涵盖三个学科(数学、编程与写作/短文),每份回答包含三种表层扰动类型:语法错误、非正式语言和非母语表述。我们要求两个最先进的开源LLM——LLaMA 3.3 70B(Meta)和Qwen 2.5 72B(阿里巴巴)——以1-10分制对回答进行评分,并明确指示仅评估内容正确性,忽略写作风格。结果显示,在写作/短文任务中,两个模型在所有扰动类型下均表现出统计显著的评分偏差(p < 0.05),效应量从中等(Cohen's d = 0.64)到极大(d = 4.25)。非正式语言受到最严重的扣分惩罚:在10分制中,LLaMA平均扣减1.90分,Qwen扣减1.20分——这一差距相当于B+与C+字母等级之间的差异。非母语表述分别被扣减1.35分和0.90分。与此形成鲜明对比的是,数学与编程任务中的偏差极小,大多数条件未达到统计显著性。这些发现表明,LLM评分偏差具有学科依赖性、风格敏感性,且即便在评分提示中明确包含抗偏差指令仍持续存在。我们讨论了这些发现对基于LLM评分系统公平部署的启示,并建议在机构采用前实施偏差审计协议。