While transformer-based systems have enabled greater accuracies with fewer training examples, data acquisition obstacles still persist for rare-class tasks -- when the class label is very infrequent (e.g. < 5% of samples). Active learning has in general been proposed to alleviate such challenges, but choice of selection strategy, the criteria by which rare-class examples are chosen, has not been systematically evaluated. Further, transformers enable iterative transfer-learning approaches. We propose and investigate transfer- and active learning solutions to the rare class problem of dissonance detection through utilizing models trained on closely related tasks and the evaluation of acquisition strategies, including a proposed probability-of-rare-class (PRC) approach. We perform these experiments for a specific rare class problem: collecting language samples of cognitive dissonance from social media. We find that PRC is a simple and effective strategy to guide annotations and ultimately improve model accuracy while transfer-learning in a specific order can improve the cold-start performance of the learner but does not benefit iterations of active learning.
翻译:尽管基于Transformer的系统能够在较少训练样本下实现更高精度,但数据获取障碍在罕见类任务中依然存在——即当类别标签极其稀疏(例如,样本中占比低于5%)时。主动学习通常被提出用于缓解此类挑战,但对选择策略(即选择罕见类样本的标准)的系统性评估尚未开展。此外,Transformer支持迭代式的迁移学习方法。本文针对失调检测中的罕见类问题,提出并研究了迁移学习与主动学习的解决方案,具体通过利用在密切关联任务上训练的模型,并结合对采集策略的评估(包括提出的罕见类概率方法)。我们针对一个特定罕见类问题开展了实验:从社交媒体中收集认知失调的语言样本。研究发现,罕见类概率方法是一种简单有效的策略,能够指导标注并最终提升模型准确率;而按特定顺序进行迁移学习可改善学习器的冷启动性能,但无法为主动学习的迭代过程带来增益。