As supervised fine-tuning of pre-trained models within NLP applications increases in popularity, larger corpora of annotated data are required, especially with increasing parameter counts in large language models. Active learning, which attempts to mine and annotate unlabeled instances to improve model performance maximally fast, is a common choice for reducing the annotation cost; however, most methods typically ignore class imbalance and either assume access to initial annotated data or require multiple rounds of active learning selection before improving rare classes. We present STENCIL, which utilizes a set of text exemplars and the recently proposed submodular mutual information to select a set of weakly labeled rare-class instances that are then strongly labeled by an annotator. We show that STENCIL improves overall accuracy by $10\%-18\%$ and rare-class F-1 score by $17\%-40\%$ on multiple text classification datasets over common active learning methods within the class-imbalanced cold-start setting.
翻译:随着预训练模型在自然语言处理应用中的监督微调日益普及,尤其是随着大语言模型参数量的增加,对大规模标注数据的需求也随之增长。主动学习旨在挖掘并标注未标记实例,以最快速度提升模型性能,是降低标注成本的常用选择;然而,大多数方法通常忽略类别不平衡问题,且要么假设已有初始标注数据,要么需要多轮主动学习选择才能改善稀有类别的表现。本文提出STENCIL方法,该方法利用一组文本示例及近期提出的子模互信息,选取一组弱标注的稀有类别实例,随后由标注者进行强标注。实验表明,在类别不平衡的冷启动场景下,相较于常见的主动学习方法,STENCIL在多个文本分类数据集上实现了整体准确率提升$10\%-18\%$,稀有类别的F-1分数提升$17\%-40\%$。