Using large training datasets enhances the generalization capabilities of neural networks. Semi-supervised learning (SSL) is useful when there are few labeled data and a lot of unlabeled data. SSL methods that use data augmentation are most successful for image datasets. In contrast, texts do not have consistent augmentation methods as images. Consequently, methods that use augmentation are not as effective in text data as they are in image data. In this study, we compared SSL algorithms that do not require augmentation; these are self-training, co-training, tri-training, and tri-training with disagreement. In the experiments, we used 4 different text datasets for different tasks. We examined the algorithms from a variety of perspectives by asking experiment questions and suggested several improvements. Among the algorithms, tri-training with disagreement showed the closest performance to the Oracle; however, performance gap shows that new semi-supervised algorithms or improvements in existing methods are needed.
翻译:使用大规模训练数据集能够增强神经网络的泛化能力。当少量标注数据与大量未标注数据并存时,半监督学习(SSL)具有实用价值。基于数据增强的SSL方法在图像数据集上最为成功。相比之下,文本数据缺乏与图像一致的增强方法,因此基于数据增强的方法在文本数据上的效果不如在图像数据上显著。本研究比较了无需数据增强的SSL算法:自训练、协同训练、三重训练及带分歧的三重训练。实验采用4个面向不同任务的文本数据集,通过预设实验问题从多维度考察算法性能,并提出若干改进方案。在所有算法中,带分歧的三重训练性能最接近Oracle基准,但性能差距表明仍需开发新的半监督算法或改进现有方法。