With the aim to provide teachers with more specific, frequent, and actionable feedback about their teaching, we explore how Large Language Models (LLMs) can be used to estimate ``Instructional Support'' domain scores of the CLassroom Assessment Scoring System (CLASS), a widely used observation protocol. We design a machine learning architecture that uses either zero-shot prompting of Meta's Llama2, and/or a classic Bag of Words (BoW) model, to classify individual utterances of teachers' speech (transcribed automatically using OpenAI's Whisper) for the presence of Instructional Support. Then, these utterance-level judgments are aggregated over an entire 15-min observation session to estimate a global CLASS score. Experiments on two CLASS-coded datasets of toddler and pre-kindergarten classrooms indicate that (1) automatic CLASS Instructional Support estimation accuracy using the proposed method (Pearson $R$ up to $0.47$) approaches human inter-rater reliability (up to $R=0.55$); (2) LLMs yield slightly greater accuracy than BoW for this task, though the best models often combined features extracted from both LLM and BoW; and (3) for classifying individual utterances, there is still room for improvement of automated methods compared to human-level judgments. Finally, (4) we illustrate how the model's outputs can be visualized at the utterance level to provide teachers with explainable feedback on which utterances were most positively or negatively correlated with specific CLASS dimensions.
翻译:为向教师提供更具体、更频繁且更具可操作性的教学反馈,本研究探索如何利用大语言模型(LLMs)估计课堂评估评分系统(CLASS)——一种广泛使用的观察协议——中"教学支持"领域的评分。我们设计了一种机器学习架构,采用Meta的Llama2零样本提示和/或经典词袋模型(BoW),对教师言语中单个话语(通过OpenAI的Whisper自动转录)进行教学支持存在性分类。随后,将这些话语级判断聚合到整个15分钟观察时段,以估计全局CLASS评分。在两组针对幼儿和学前班课堂的CLASS编码数据集上的实验表明:(1)使用该方法自动估计CLASS教学支持评分的准确性(Pearson相关系数$R$高达$0.47$)接近人类评分者间信度(高达$R=0.55$);(2)在此任务中,LLMs的准确性略高于BoW,但最佳模型通常结合了LLM和BoW提取的特征;(3)在单句话语分类方面,自动化方法相比人工判断仍有改进空间。最后,(4)我们展示了如何将模型输出以话语级可视化呈现,为教师提供关于哪些话语与特定CLASS维度正/负相关性最强的可解释反馈。