NERIF: GPT-4V for Automatic Scoring of Drawn Models

Scoring student-drawn models is time-consuming. Recently released GPT-4V provides a unique opportunity to advance scientific modeling practices by leveraging the powerful image processing capability. To test this ability specifically for automatic scoring, we developed a method NERIF (Notation-Enhanced Rubric Instruction for Few-shot Learning) employing instructional note and rubrics to prompt GPT-4V to score students' drawn models for science phenomena. We randomly selected a set of balanced data (N = 900) that includes student-drawn models for six modeling assessment tasks. Each model received a score from GPT-4V ranging at three levels: 'Beginning,' 'Developing,' or 'Proficient' according to scoring rubrics. GPT-4V scores were compared with human experts' scores to calculate scoring accuracy. Results show that GPT-4V's average scoring accuracy was mean =.51, SD = .037. Specifically, average scoring accuracy was .64 for the 'Beginning' class, .62 for the 'Developing' class, and .26 for the 'Proficient' class, indicating that more proficient models are more challenging to score. Further qualitative study reveals how GPT-4V retrieves information from image input, including problem context, example evaluations provided by human coders, and students' drawing models. We also uncovered how GPT-4V catches the characteristics of student-drawn models and narrates them in natural language. At last, we demonstrated how GPT-4V assigns scores to student-drawn models according to the given scoring rubric and instructional notes. Our findings suggest that the NERIF is an effective approach for employing GPT-4V to score drawn models. Even though there is space for GPT-4V to improve scoring accuracy, some mis-assigned scores seemed interpretable to experts. The results of this study show that utilizing GPT-4V for automatic scoring of student-drawn models is promising.

翻译：评分学生绘制的模型耗时较长。近期发布的GPT-4V凭借其强大的图像处理能力，为推进科学建模实践提供了独特机遇。为检验这一能力在自动评分中的具体表现，我们提出一种名为NERIF（基于增强标注规则的少样本学习提示方法）的方法，通过结合教学说明与评分规则，引导GPT-4V对学生在科学现象中绘制的模型进行评分。我们从六项建模评估任务中随机选取一组均衡数据集（N=900），每项任务包含学生绘制的模型。根据评分规则，GPT-4V为每个模型分配“初级”、“发展中”或“熟练”三个等级的得分。将GPT-4V的评分结果与人类专家评分进行比较以计算评分准确率。结果显示：GPT-4V的平均评分准确率均值=0.51，标准差=0.037。具体而言，“初级”类别的平均准确率为0.64，“发展中”类别为0.62，“熟练”类别为0.26，表明熟练水平越高的模型评分难度越大。进一步的质性研究揭示了GPT-4V如何从图像输入中提取信息，包括问题情境、人类编码员提供的示例评价以及学生的绘制模型。我们还发现GPT-4V如何捕捉学生绘制模型的特征，并以自然语言进行描述。最后，我们展示了GPT-4V如何根据给定的评分规则与教学说明为学生绘制的模型分配分数。研究结果表明，NERIF是利用GPT-4V对绘制模型进行评分的有效方法。尽管GPT-4V在评分准确性上仍有提升空间，但部分误判分数对专家而言具有可解释性。本研究结果证明，利用GPT-4V对学生绘制的模型进行自动评分具有广阔前景。