Training deep neural networks (DNNs) with limited supervision has been a popular research topic as it can significantly alleviate the annotation burden. Self-training has been successfully applied in semi-supervised learning tasks, but one drawback of self-training is that it is vulnerable to the label noise from incorrect pseudo labels. Inspired by the fact that samples with similar labels tend to share similar representations, we develop a neighborhood-based sample selection approach to tackle the issue of noisy pseudo labels. We further stabilize self-training via aggregating the predictions from different rounds during sample selection. Experiments on eight tasks show that our proposed method outperforms the strongest self-training baseline with 1.83% and 2.51% performance gain for text and graph datasets on average. Our further analysis demonstrates that our proposed data selection strategy reduces the noise of pseudo labels by 36.8% and saves 57.3% of the time when compared with the best baseline. Our code and appendices will be uploaded to https://github.com/ritaranx/NeST.
翻译:深度神经网络(DNNs)在有限监督条件下的训练已成为热门研究方向,因其能显著降低标注负担。自训练方法已在半监督学习任务中取得成功,但其存在一个缺陷:容易受到错误伪标签带来的标签噪声影响。受"相似标签的样本倾向于共享相似表征"这一现象启发,我们提出了一种基于邻域的样本选择方法来处理伪标签噪声问题。进一步地,我们通过聚合不同轮次样本选择过程中的预测结果来稳定自训练过程。在八项任务上的实验表明,我们提出的方法相较于最强自训练基线,在文本和图数据集上平均分别获得1.83%和2.51%的性能提升。进一步分析显示,与最优基线相比,我们提出的数据选择策略将伪标签噪声降低了36.8%,同时节省了57.3%的处理时间。相关代码与附录将上传至https://github.com/ritaranx/NeST。