This paper assesses the potential for the large language models (LLMs) GPT-4 and GPT-3.5 to aid in deriving insight from education feedback surveys. Exploration of LLM use cases in education has focused on teaching and learning, with less exploration of capabilities in education feedback analysis. Survey analysis in education involves goals such as finding gaps in curricula or evaluating teachers, often requiring time-consuming manual processing of textual responses. LLMs have the potential to provide a flexible means of achieving these goals without specialized machine learning models or fine-tuning. We demonstrate a versatile approach to such goals by treating them as sequences of natural language processing (NLP) tasks including classification (multi-label, multi-class, and binary), extraction, thematic analysis, and sentiment analysis, each performed by LLM. We apply these workflows to a real-world dataset of 2500 end-of-course survey comments from biomedical science courses, and evaluate a zero-shot approach (i.e., requiring no examples or labeled training data) across all tasks, reflecting education settings, where labeled data is often scarce. By applying effective prompting practices, we achieve human-level performance on multiple tasks with GPT-4, enabling workflows necessary to achieve typical goals. We also show the potential of inspecting LLMs' chain-of-thought (CoT) reasoning for providing insight that may foster confidence in practice. Moreover, this study features development of a versatile set of classification categories, suitable for various course types (online, hybrid, or in-person) and amenable to customization. Our results suggest that LLMs can be used to derive a range of insights from survey text.
翻译:本文评估了大型语言模型GPT-4和GPT-3.5在从教育反馈调查中提取见解方面的潜力。LLM在教育领域的应用探索主要集中在教学与学习层面,而对其在教育反馈分析能力的探讨相对较少。教育领域的调查分析涉及发现课程缺口或评估教师等目标,通常需要耗时的人工处理文本反馈。LLM提供了实现这些目标的灵活手段,无需专用机器学习模型或微调。我们通过将这些目标转化为一系列自然语言处理任务(包括分类(多标签、多类别及二元分类)、提取、主题分析和情感分析),均由LLM执行,展示了一种通用方法。我们将这些工作流程应用于包含2500条生物医学课程期末调查评论的真实数据集,并在所有任务中采用零样本方法(即无需示例或标注训练数据),这反映了教育场景中标注数据往往匮乏的现状。通过应用有效的提示实践,GPT-4在多项任务上达到了人类水平的表现,从而实现了达成典型目标所需的工作流程。我们还展示了检查LLM思维链推理的潜力,这种推理可通过提供洞察来增强实际应用中的可信度。此外,本研究开发了一套通用的分类类别,适用于各类课程类型(在线、混合或面对面教学),并可根据需求进行定制。我们的结果表明,LLM可用于从调查文本中提取多种见解。