We propose a semi-supervised text classifier based on self-training using one positive and one negative property of neural networks. One of the weaknesses of self-training is the semantic drift problem, where noisy pseudo-labels accumulate over iterations and consequently the error rate soars. In order to tackle this challenge, we reshape the role of pseudo-labels and create a hierarchical order of information. In addition, a crucial step in self-training is to use the classifier confidence prediction to select the best candidate pseudo-labels. This step cannot be efficiently done by neural networks, because it is known that their output is poorly calibrated. To overcome this challenge, we propose a hybrid metric to replace the plain confidence measurement. Our metric takes into account the prediction uncertainty via a subsampling technique. We evaluate our model in a set of five standard benchmarks, and show that it significantly outperforms a set of ten diverse baseline models. Furthermore, we show that the improvement achieved by our model is additive to language model pretraining, which is a widely used technique for using unlabeled documents. Our code is available at https://github.com/p-karisani/RST.
翻译:我们提出一种基于自训练的半监督文本分类器,通过利用神经网络的一正一反两种特性实现。自训练的一个薄弱环节是语义漂移问题:噪声伪标签在迭代过程中持续累积,导致错误率急剧攀升。为应对这一挑战,我们重塑伪标签的角色,并构建了信息的层次化结构。此外,自训练的关键步骤在于利用分类器置信度预测筛选最优候选伪标签,而神经网络因输出校准不佳而难以高效完成此步骤。为克服此难题,我们提出一种混合评估指标替代单纯的置信度度量,该指标通过子采样技术考量预测不确定性。我们在五组标准基准评估集上的实验表明,该模型显著优于十组多样化的基线模型。进一步验证表明,该模型带来的性能提升与语言模型预训练(一种广泛使用的未标注文档利用技术)具有累加性。相关代码已开源至 https://github.com/p-karisani/RST。