Semi-supervised text classification-based paradigms (SSTC) typically employ the spirit of self-training. The key idea is to train a deep classifier on limited labeled texts and then iteratively predict the unlabeled texts as their pseudo-labels for further training. However, the performance is largely affected by the accuracy of pseudo-labels, which may not be significant in real-world scenarios. This paper presents a Rank-aware Negative Training (RNT) framework to address SSTC in learning with noisy label manner. To alleviate the noisy information, we adapt a reasoning with uncertainty-based approach to rank the unlabeled texts based on the evidential support received from the labeled texts. Moreover, we propose the use of negative training to train RNT based on the concept that ``the input instance does not belong to the complementary label''. A complementary label is randomly selected from all labels except the label on-target. Intuitively, the probability of a true label serving as a complementary label is low and thus provides less noisy information during the training, resulting in better performance on the test data. Finally, we evaluate the proposed solution on various text classification benchmark datasets. Our extensive experiments show that it consistently overcomes the state-of-the-art alternatives in most scenarios and achieves competitive performance in the others. The code of RNT is publicly available at:https://github.com/amurtadha/RNT.
翻译:半监督文本分类范式通常采用自训练的思想。其核心是在少量标注文本上训练深度分类器,然后迭代地将未标注文本预测为伪标签以进行进一步训练。然而,性能在很大程度上受到伪标签准确性的影响,这在现实场景中可能并不显著。本文提出一种排序感知的负训练框架,以在带噪声标签的学习方式下解决半监督文本分类问题。为缓解噪声信息,我们采用基于不确定性推理的方法,根据来自标注文本的证据支持程度对未标注文本进行排序。此外,我们提出利用负训练来训练RNT框架,其核心理念是"输入实例不属于互补标签"。互补标签是从除目标标签外的所有标签中随机选取的。直觉上,真实标签作为互补标签的概率较低,因此在训练过程中提供的噪声信息更少,从而在测试数据上获得更好的性能。最后,我们在多个文本分类基准数据集上评估了所提出的解决方案。大量实验表明,该方法在大多数场景下持续优于现有最优方案,并在其他场景下取得了具有竞争力的性能。RNT代码已公开于:https://github.com/amurtadha/RNT。