Automated explanatory feedback systems play a crucial role in facilitating learning for a large cohort of learners by offering feedback that incorporates explanations, significantly enhancing the learning process. However, delivering such explanatory feedback in real-time poses challenges, particularly when high classification accuracy for domain-specific, nuanced responses is essential. Our study leverages the capabilities of large language models, specifically Generative Pre-Trained Transformers (GPT), to explore a sequence labeling approach focused on identifying components of desired and less desired praise for providing explanatory feedback within a tutor training dataset. Our aim is to equip tutors with actionable, explanatory feedback during online training lessons. To investigate the potential of GPT models for providing the explanatory feedback, we employed two commonly-used approaches: prompting and fine-tuning. To quantify the quality of highlighted praise components identified by GPT models, we introduced a Modified Intersection over Union (M-IoU) score. Our findings demonstrate that: (1) the M-IoU score effectively correlates with human judgment in evaluating sequence quality; (2) using two-shot prompting on GPT-3.5 resulted in decent performance in recognizing effort-based (M-IoU of 0.46) and outcome-based praise (M-IoU of 0.68); and (3) our optimally fine-tuned GPT-3.5 model achieved M-IoU scores of 0.64 for effort-based praise and 0.84 for outcome-based praise, aligning with the satisfaction levels evaluated by human coders. Our results show promise for using GPT models to provide feedback that focuses on specific elements in their open-ended responses that are desirable or could use improvement.
翻译:自动化解释性反馈系统通过提供包含解释的反馈,在促进大规模学习者群体的学习过程中发挥着关键作用,显著增强了学习效果。然而,实时提供此类解释性反馈面临挑战,尤其是当需要对领域特定且细微的回应实现高分类准确率时。本研究利用大型语言模型——特别是生成式预训练变换模型(GPT)——的能力,探索了一种序列标注方法,该方法聚焦于识别导师培训数据集中表扬性反馈的理想与非理想成分。我们的目标是让导师在在线培训课程中获得可操作的解释性反馈。为探究GPT模型提供解释性反馈的潜力,我们采用了两种常用方法:提示学习和微调。为了量化GPT模型识别出的表扬成分标注质量,我们引入了改进的交并比(M-IoU)分数。研究结果表明:(1)M-IoU分数能有效关联人类对序列质量的判断;(2)使用GPT-3.5的双样本提示学习在识别基于努力的表扬(M-IoU为0.46)和基于结果的表扬(M-IoU为0.68)方面表现出色;(3)我们最优微调的GPT-3.5模型在基于努力的表扬上达到M-IoU为0.64,在基于结果的表扬上达到M-IoU为0.84,与人类编码员评估的满意度水平一致。我们的结果表明,使用GPT模型提供针对开放式回答中特定元素的反馈(无论是理想部分还是可改进部分)具有良好前景。