Large language models (LLMs) have created new opportunities to assist teachers and support student learning. While researchers have explored various prompt engineering approaches in educational contexts, the degree to which these approaches generalize across domains--such as science, computing, and engineering--remains underexplored. In this paper, we introduce Chain-of-Thought Prompting + Active Learning (CoTAL), an LLM-based approach to formative assessment scoring that (1) leverages Evidence-Centered Design (ECD) to align assessments and rubrics with curriculum goals, (2) applies human-in-the-loop prompt engineering to automate response scoring, and (3) incorporates chain-of-thought (CoT) prompting and teacher and student feedback to iteratively refine questions, rubrics, and LLM prompts. Our findings demonstrate that CoTAL improves GPT-4's scoring performance across domains, achieving gains of up to 38.9% over a non-prompt-engineered baseline (i.e., without labeled examples, chain-of-thought prompting, or iterative refinement). Teachers and students judge CoTAL to be effective at scoring and explaining responses, and their feedback produces valuable insights that enhance grading accuracy and explanation quality.
翻译:大型语言模型(LLMs)为辅助教师和促进学生学习创造了新的机遇。尽管研究者已在教育场景中探索了多种提示工程方法,但这些方法在科学、计算和工程等不同学科领域的泛化程度仍研究不足。本文提出CoTAL(思维链提示+主动学习)——一种基于LLM的形成性评估评分方法,该方法(1)利用以证据为中心的设计(ECD)将评估与评分标准对齐课程目标,(2)采用人在回路提示工程实现响应评分的自动化,(3)整合思维链(CoT)提示以及师生反馈,迭代优化问题、评分标准和LLM提示。实验结果表明,CoTAL能提升GPT-4跨领域评分性能,相较于无提示工程基线(即无标注示例、无思维链提示及无迭代优化),准确率最高提升38.9%。教师和学生认为CoTAL在评分与解释响应方面表现有效,其反馈产生的洞见显著提升了评分准确性与解释质量。