Semi-supervised learning (SSL) is a popular setting aiming to effectively utilize unlabelled data to improve model performance in downstream natural language processing (NLP) tasks. Currently, there are two popular approaches to make use of unlabelled data: Self-training (ST) and Task-adaptive pre-training (TAPT). ST uses a teacher model to assign pseudo-labels to the unlabelled data, while TAPT continues pre-training on the unlabelled data before fine-tuning. To the best of our knowledge, the effectiveness of TAPT in SSL tasks has not been systematically studied, and no previous work has directly compared TAPT and ST in terms of their ability to utilize the pool of unlabelled data. In this paper, we provide an extensive empirical study comparing five state-of-the-art ST approaches and TAPT across various NLP tasks and data sizes, including in- and out-of-domain settings. Surprisingly, we find that TAPT is a strong and more robust SSL learner, even when using just a few hundred unlabelled samples or in the presence of domain shifts, compared to more sophisticated ST approaches, and tends to bring greater improvements in SSL than in fully-supervised settings. Our further analysis demonstrates the risks of using ST approaches when the size of labelled or unlabelled data is small or when domain shifts exist. We offer a fresh perspective for future SSL research, suggesting the use of unsupervised pre-training objectives over dependency on pseudo labels.
翻译:半监督学习(SSL)是一种流行的设定,旨在有效利用无标注数据来提升下游自然语言处理(NLP)任务的模型性能。目前,利用无标注数据主要有两种流行方法:自训练(ST)和任务自适应预训练(TAPT)。ST使用教师模型为无标注数据分配伪标签,而TAPT则在微调之前继续对无标注数据进行预训练。据我们所知,TAPT在SSL任务中的有效性尚未得到系统研究,且此前没有工作直接比较TAPT和ST在利用无标注数据池方面的能力。在本文中,我们进行了广泛的实证研究,比较了五种最先进的ST方法和TAPT在多种NLP任务和数据规模下的表现,包括领域内和领域外设定。令人惊讶的是,我们发现,即使仅使用几百个无标注样本或存在领域偏移的情况下,TAPT相比更复杂的ST方法也是一种强大且更稳健的SSL学习器,并且在SSL中往往比在全监督设定中带来更大的改进。我们的进一步分析表明,在有标注或无标注数据规模较小,或存在领域偏移时,使用ST方法存在风险。我们为未来的SSL研究提供了全新视角,建议使用无监督预训练目标而非依赖伪标签。