Active learning is an iterative labeling process that is used to obtain a small labeled subset, despite the absence of labeled data, thereby enabling to train a model for supervised tasks such as text classification. While active learning has made considerable progress in recent years due to improvements provided by pre-trained language models, there is untapped potential in the often neglected unlabeled portion of the data, although it is available in considerably larger quantities than the usually small set of labeled data. In this work, we investigate how self-training, a semi-supervised approach that uses a model to obtain pseudo-labels for unlabeled data, can be used to improve the efficiency of active learning for text classification. Building on a comprehensive reproduction of four previous self-training approaches, some of which are evaluated for the first time in the context of active learning or natural language processing, we introduce HAST, a new and effective self-training strategy, which is evaluated on four text classification benchmarks. Our results show that it outperforms the reproduced self-training approaches and reaches classification results comparable to previous experiments for three out of four datasets, using as little as 25% of the data. The code is publicly available at https://github.com/chschroeder/self-training-for-sample-efficient-active-learning .
翻译:主动学习是一种迭代标注过程,用于在缺乏标注数据的情况下获取少量标注样本子集,从而能够训练用于文本分类等监督任务的模型。尽管预训练语言模型的改进使主动学习近年来取得显著进展,但数据中常被忽略的未标注部分仍存在未开发的潜力——尽管其数量通常远大于通常规模较小的标注数据集。本研究探讨了如何利用自训练(一种使用模型为未标注数据生成伪标签的半监督方法)来提高文本分类主动学习的效率。在全面复现四种现有自训练方法的基础上(其中部分方法系首次在主动学习或自然语言处理背景下进行评估),我们提出了HAST这一新颖有效的自训练策略,并在四个文本分类基准数据集上进行了评估。实验结果表明,该方法优于所有复现的自训练方法,且在仅使用25%数据量的情况下,于四分之三的数据集上达到了与先前实验相当的分类性能。代码已公开于https://github.com/chschroeder/self-training-for-sample-efficient-active-learning。