As the world becomes increasingly saturated with AI-generated content, disinformation, and algorithmic persuasion, critical thinking - the capacity to evaluate evidence, detect unreliable claims, and exercise independent judgment - is becoming a defining human skill. Developing critical thinking skills through timely assessment and feedback is crucial; however, there has not been extensive work in educational data mining on defining, measuring, and supporting critical thinking. In this paper, we investigate the feasibility of measuring "subskills" that underlie critical thinking. We ground our work in an authentic task where students operationalize critical thinking by writing argumentative essays. We developed a coding rubric based on an established skills progression and completed human coding for a corpus of student essays. We then evaluated three distinct approaches to automated scoring: zero-shot prompting, few-shot prompting, and supervised fine-tuning, implemented across three large language models (GPT-5, Llama 3.1 8B, and ModernBERT). Fine-tuning Llama 3.1 8B achieved the best results and demonstrated particular strength on subskills with highly separable proficiency levels with balanced labels across levels, while lower performance was observed for subskills that required detection of subtle distinctions between proficiency levels or imbalanced labels. Our exploratory work represents an initial step toward scalable assessment of critical thinking skills across authentic educational contexts. Future research should continue to combine automated critical thinking assessment with human validation to more accurately detect and measure dynamic, higher-order thinking skills.
翻译:随着世界日益充斥着人工智能生成的内容、虚假信息和算法驱动的说服手段,批判性思维——即评估证据、识别不可靠主张并行使独立判断的能力——正成为一项决定性的人类技能。通过及时评估与反馈来培养批判性思维能力至关重要;然而,在教育数据挖掘领域,对于如何定义、测量及支持批判性思维的研究尚不充分。本文探讨了测量构成批判性思维基础的“子技能”的可行性。我们将研究工作建立在一项真实任务之上,即学生通过撰写议论文来实践批判性思维。我们基于一个已确立的技能进阶框架开发了编码评分标准,并对一个学生作文语料库完成了人工编码。随后,我们评估了三种不同的自动化评分方法:零样本提示、少样本提示和监督微调,并在三种大语言模型(GPT-5、Llama 3.1 8B 和 ModernBERT)上进行了实现。微调后的 Llama 3.1 8B 取得了最佳效果,并在那些熟练度等级高度可分且各等级标签分布均衡的子技能上表现出色;而对于需要检测熟练度等级间细微差异或标签分布不均衡的子技能,其表现则较低。我们的探索性研究代表了迈向在真实教育场景中规模化评估批判性思维技能的初步尝试。未来的研究应继续将自动化批判性思维评估与人工验证相结合,以更准确地检测和测量动态的高阶思维能力。