Benchmarking Educational LLMs with Analytics: A Case Study on Gender Bias in Feedback

As teachers increasingly turn to GenAI in their educational practice, we need robust methods to benchmark large language models (LLMs) for pedagogical purposes. This article presents an embedding-based benchmarking framework to detect bias in LLMs in the context of formative feedback. Using 600 authentic student essays from the AES 2.0 corpus, we constructed controlled counterfactuals along two dimensions: (i) implicit cues via lexicon-based swaps of gendered terms within essays, and (ii) explicit cues via gendered author background in the prompt. We investigated six representative LLMs (i.e. GPT-5 mini, GPT-4o mini, DeepSeek-R1, DeepSeek-R1-Qwen, Gemini 2.5 Pro, Llama-3-8B). We first quantified the response divergence with cosine and Euclidean distances over sentence embeddings, then assessed significance via permutation tests, and finally, visualised structure using dimensionality reduction. In all models, implicit manipulations reliably induced larger semantic shifts for male-female counterfactuals than for female-male. Only the GPT and Llama models showed sensitivity to explicit gender cues. These findings show that even state-of-the-art LLMs exhibit asymmetric semantic responses to gender substitutions, suggesting persistent gender biases in feedback they provide learners. Qualitative analyses further revealed consistent linguistic differences (e.g., more autonomy-supportive feedback under male cues vs. more controlling feedback under female cues). We discuss implications for fairness auditing of pedagogical GenAI, propose reporting standards for counterfactual evaluation in learning analytics, and outline practical guidance for prompt design and deployment to safeguard equitable feedback.

翻译：随着教师在日常教学中越来越多地使用生成式人工智能，我们需要健壮的方法来评测用于教育目的的大语言模型。本文提出一种基于嵌入向量的评测框架，用于检测大语言模型在形成性反馈情境中的偏见。利用AES 2.0语料库中600篇真实的学生作文，我们从两个维度构建了受控反事实样本：(i) 通过基于词典的作文内性别术语替换来嵌入隐式线索；(ii) 通过在提示中设置性别化的作者背景来嵌入显式线索。我们研究了六个代表性模型（即GPT-5 mini、GPT-4o mini、DeepSeek-R1、DeepSeek-R1-Qwen、Gemini 2.5 Pro、Llama-3-8B）。首先，我们使用余弦距离和欧氏距离对句子嵌入计算响应差异，然后通过置换检验评估显著性，最后利用降维方法可视化结构。在所有模型中，隐式操作引起的语义偏移（从男性到女性的反事实）均大于从女性到男性的情况。仅GPT和Llama模型对显式性别线索敏感。这些发现表明，即便最先进的大语言模型也对性别替换表现出非对称的语义响应，暗示了其在提供给学习者的反馈中存在着持久的性别偏见。定性分析进一步揭示了一致的语言差异（例如，在男性线索下提供更多自主支持型反馈，而在女性线索下提供更多控制型反馈）。我们讨论了这些发现对教育生成式人工智能公平性审计的启示，提出了学习分析中反事实评估的报告规范，并为保障反馈的公平性概述了提示设计与部署方面的实践指导。