While transformer-based systems have enabled greater accuracies with fewer training examples, data acquisition obstacles still persist for rare-class tasks -- when the class label is very infrequent (e.g. < 5% of samples). Active learning has in general been proposed to alleviate such challenges, but choice of selection strategy, the criteria by which rare-class examples are chosen, has not been systematically evaluated. Further, transformers enable iterative transfer-learning approaches. We propose and investigate transfer- and active learning solutions to the rare class problem of dissonance detection through utilizing models trained on closely related tasks and the evaluation of acquisition strategies, including a proposed probability-of-rare-class (PRC) approach. We perform these experiments for a specific rare class problem: collecting language samples of cognitive dissonance from social media. We find that PRC is a simple and effective strategy to guide annotations and ultimately improve model accuracy while transfer-learning in a specific order can improve the cold-start performance of the learner but does not benefit iterations of active learning.
翻译:基于Transformer的系统虽能以更少训练样本实现更高准确率,但在类别标签极为罕见(例如样本占比低于5%)的稀有类别任务中,数据获取障碍依然存在。主动学习通常被提出用于缓解此类难题,但选择策略(即选取稀有类别样本的准则)尚未得到系统评估。此外,Transformer支持迭代式迁移学习方法。我们提出并探究了针对失调检测中稀有类别问题的迁移学习与主动学习解决方案,具体包括:利用在密切关联任务上训练的模型,以及评估多种采集策略(含本文提出的稀有类别概率法)。我们针对特定稀有类别问题开展实验:从社交媒体收集认知失调的语言样本。实验发现,稀有类别概率法是一种简单有效的注释引导策略,能最终提升模型准确率;而按特定顺序进行迁移学习可改善学习器的冷启动性能,但对主动学习的迭代过程无显著增益。