The advent of large language models (LLMs) in the education sector has provided impetus to automate grading short answer questions. LLMs make evaluating short answers very efficient, thus addressing issues like staff shortage. However, in the task of Automated Short Answer Grading (ASAG), LLM responses are influenced by diverse perspectives in their training dataset, leading to inaccuracies in evaluating nuanced or partially correct answers. To address this challenge, we propose a novel framework, Grade Guard. 1. To enhance the task-based specialization of the LLMs, the temperature parameter has been fine-tuned using Root Mean Square Error (RMSE). 2. Unlike traditional approaches, LLMs in Grade Guard compute an Indecisiveness Score (IS) along with the grade to reflect uncertainty in predicted grades. 3. Introduced Confidence-Aware Loss (CAL) to generate an optimized Indecisiveness Score (IS). 4. To improve reliability, self-reflection based on the optimized IS has been introduced into the framework, enabling human re-evaluation to minimize incorrect grade assignments. Our experimentation shows that the best setting of Grade Guard outperforms traditional methods by 19.16% RMSE in Upstage Solar Pro, 23.64% RMSE in Upstage Solar Mini, 4.00% RMSE in Gemini 1.5 Flash, and 10.20% RMSE in GPT 4-o Mini. Future work includes improving interpretability by generating rationales for grades to enhance accuracy. Expanding benchmark datasets and annotating them with domain-specific nuances will enhance grading accuracy. Finally, analyzing feedback to enhance confidence in predicted grades, reduce biases, optimize grading criteria, and personalize learning while supporting multilingual grading systems will make the solution more accurate, adaptable, fair, and inclusive.
翻译:大型语言模型(LLMs)在教育领域的出现为自动化简答题评分提供了动力。LLMs使简答题的评估变得非常高效,从而解决了师资短缺等问题。然而,在自动简答题评分任务中,LLMs的响应受到其训练数据集中多样观点的影响,导致在评估具有细微差别或部分正确的答案时出现不准确。为应对这一挑战,我们提出了一种新颖的框架——Grade Guard。1. 为了增强LLMs基于任务的专门化,我们使用均方根误差对温度参数进行了微调。2. 与传统方法不同,Grade Guard中的LLMs在给出评分的同时会计算一个"犹豫度分数",以反映预测评分的不确定性。3. 引入了置信度感知损失函数来生成优化的犹豫度分数。4. 为了提高可靠性,框架中引入了基于优化犹豫度分数的自我反思机制,从而能够进行人工复评,以最小化错误评分分配。我们的实验表明,Grade Guard的最佳设置在多个模型上均优于传统方法:在Upstage Solar Pro上RMSE提升了19.16%,在Upstage Solar Mini上提升了23.64%,在Gemini 1.5 Flash上提升了4.00%,在GPT 4-o Mini上提升了10.20%。未来的工作包括通过生成评分理由来提高可解释性以增强准确性。扩展基准数据集并用特定领域的细微差别对其进行标注将提高评分准确性。最后,分析反馈以增强对预测评分的信心、减少偏见、优化评分标准、实现个性化学习,同时支持多语言评分系统,将使该解决方案更加准确、适应性强、公平且包容。