Does Informativeness Matter? Active Learning for Educational Dialogue Act Classification

Dialogue Acts (DAs) can be used to explain what expert tutors do and what students know during the tutoring process. Most empirical studies adopt the random sampling method to obtain sentence samples for manual annotation of DAs, which are then used to train DA classifiers. However, these studies have paid little attention to sample informativeness, which can reflect the information quantity of the selected samples and inform the extent to which a classifier can learn patterns. Notably, the informativeness level may vary among the samples and the classifier might only need a small amount of low informative samples to learn the patterns. Random sampling may overlook sample informativeness, which consumes human labelling costs and contributes less to training the classifiers. As an alternative, researchers suggest employing statistical sampling methods of Active Learning (AL) to identify the informative samples for training the classifiers. However, the use of AL methods in educational DA classification tasks is under-explored. In this paper, we examine the informativeness of annotated sentence samples. Then, the study investigates how the AL methods can select informative samples to support DA classifiers in the AL sampling process. The results reveal that most annotated sentences present low informativeness in the training dataset and the patterns of these sentences can be easily captured by the DA classifier. We also demonstrate how AL methods can reduce the cost of manual annotation in the AL sampling process.

翻译：对话行为（DAs）可用于解释专家教师在辅导过程中的行为以及学生的学习状态。大多数实证研究采用随机抽样方法获取句子样本进行人工标注，进而训练DA分类器。然而，这些研究较少关注样本的信息量——信息量既能反映所选样本的信息丰富程度，也能表征分类器学习模式的能力边界。值得关注的是，不同样本的信息量可能存在差异，分类器可能仅需少量低信息量样本即可掌握相应模式。随机抽样可能忽视样本信息量，这不仅消耗人力标注成本，对分类器训练的贡献也有限。作为替代方案，研究者建议采用主动学习（AL）的统计抽样方法识别高信息量样本来训练分类器。但AL方法在教育领域DA分类任务中的应用尚待深入探索。本文首先检验已标注句子样本的信息量，进而研究AL方法如何在采样过程中选择高信息量样本以支持DA分类器。结果显示，训练数据集中多数已标注句子呈现低信息量特征，且DA分类器可轻易捕捉这些句子的模式。我们同时论证了AL方法在采样过程中降低人工标注成本的有效性。

相关内容

分类器

关注 6

分类是数据挖掘的一种非常重要的方法。分类的概念是在已有数据的基础上学会一个分类函数或构造出一个分类模型（即我们通常所说的分类器(Classifier)）。该函数或模型能够把数据库中的数据纪录映射到给定类别中的某一个，从而可以应用于数据预测。总之，分类器是数据挖掘中对样本进行分类的方法的统称，包含决策树、逻辑回归、朴素贝叶斯、神经网络等算法。

【吴恩达新课程】ChatGPT提示工程，ChatGPT Prompt Engineering for Developers

专知会员服务

104+阅读 · 2023年4月28日