Using large training datasets enhances the generalization capabilities of neural networks. Semi-supervised learning (SSL) is useful when there are few labeled data and a lot of unlabeled data. SSL methods that use data augmentation are most successful for image datasets. In contrast, texts do not have consistent augmentation methods as images. Consequently, methods that use augmentation are not as effective in text data as they are in image data. In this study, we compared SSL algorithms that do not require augmentation; these are self-training, co-training, tri-training, and tri-training with disagreement. In the experiments, we used 4 different text datasets for different tasks. We examined the algorithms from a variety of perspectives by asking experiment questions and suggested several improvements. Among the algorithms, tri-training with disagreement showed the closest performance to the Oracle; however, performance gap shows that new semi-supervised algorithms or improvements in existing methods are needed.
翻译:使用大规模训练数据集能够增强神经网络的泛化能力。当标注数据稀缺而大量未标注数据可用时,半监督学习(SSL)具有重要价值。基于数据增强的SSL方法在图像数据集中最为成功。然而,文本数据缺乏像图像那样统一的增强方法,因此基于增强的方法在文本数据上的效果不如图像数据。本研究对比了无需数据增强的SSL算法,包括自训练、协同训练、三体训练以及带分歧的三体训练。实验中,我们针对不同任务使用了4个文本数据集,通过实验问题从多维度考察了这些算法,并提出了若干改进方案。在各类算法中,带分歧的三体训练表现出最接近Oracle模型的性能;但性能差距表明,仍需开发新的半监督算法或对现有方法进行改进。