SIGHT: A Large Annotated Dataset on Student Insights Gathered from Higher Education Transcripts

from arxiv, First two authors contributed equally. In the Proceedings of Innovative Use of NLP for Building Educational Applications 2023. The code and data are open-sourced here: https://github.com/rosewang2008/sight

Lectures are a learning experience for both students and teachers. Students learn from teachers about the subject material, while teachers learn from students about how to refine their instruction. However, online student feedback is unstructured and abundant, making it challenging for teachers to learn and improve. We take a step towards tackling this challenge. First, we contribute a dataset for studying this problem: SIGHT is a large dataset of 288 math lecture transcripts and 15,784 comments collected from the Massachusetts Institute of Technology OpenCourseWare (MIT OCW) YouTube channel. Second, we develop a rubric for categorizing feedback types using qualitative analysis. Qualitative analysis methods are powerful in uncovering domain-specific insights, however they are costly to apply to large data sources. To overcome this challenge, we propose a set of best practices for using large language models (LLMs) to cheaply classify the comments at scale. We observe a striking correlation between the model's and humans' annotation: Categories with consistent human annotations (>$0.9$ inter-rater reliability, IRR) also display higher human-model agreement (>$0.7$), while categories with less consistent human annotations ($0.7$-$0.8$ IRR) correspondingly demonstrate lower human-model agreement ($0.3$-$0.5$). These techniques uncover useful student feedback from thousands of comments, costing around $\$0.002$ per comment. We conclude by discussing exciting future directions on using online student feedback and improving automated annotation techniques for qualitative research.

翻译：讲座对师生双方而言都是学习体验。学生向教师学习学科知识，教师则从学生反馈中学习如何改进教学。然而，在线学生反馈存在非结构化和海量性的特点，使教师难以从中学习与提升。我们朝着解决这一挑战迈出了一步。首先，我们贡献了一个研究该问题的数据集：SIGHT是一个大型数据集，包含从麻省理工学院开放课件（MIT OCW）YouTube频道收集的288份数学讲座转录文本和15,784条评论。其次，我们通过定性分析制定了一套反馈类型分类标准。定性分析方法在揭示领域特定见解方面具有强大效力，但应用于大规模数据源时成本高昂。为克服这一挑战，我们提出了一套利用大型语言模型（LLMs）实现大规模低成本评论分类的最佳实践。我们发现模型标注与人工标注之间存在显著相关性：人工标注一致性高的类别（评分者间信度IRR>0.9）同时展现出更高的人机一致性（>0.7），而人工标注一致性较低的类别（IRR 0.7-0.8）相应呈现较低的人机一致性（0.3-0.5）。这些技术从数千条评论中挖掘出有价值的学情反馈，每条评论成本约为0.002美元。最后，我们探讨了利用在线学生反馈及改进定性研究自动标注技术的未来方向。