Semi-Supervised Text Classification (SSTC) mainly works under the spirit of self-training. They initialize the deep classifier by training over labeled texts; and then alternatively predict unlabeled texts as their pseudo-labels and train the deep classifier over the mixture of labeled and pseudo-labeled texts. Naturally, their performance is largely affected by the accuracy of pseudo-labels for unlabeled texts. Unfortunately, they often suffer from low accuracy because of the margin bias problem caused by the large difference between representation distributions of labels in SSTC. To alleviate this problem, we apply the angular margin loss, and perform several Gaussian linear transformations to achieve balanced label angle variances, i.e., the variance of label angles of texts within the same label. More accuracy of predicted pseudo-labels can be achieved by constraining all label angle variances balanced, where they are estimated over both labeled and pseudo-labeled texts during self-training loops. With this insight, we propose a novel SSTC method, namely Semi-Supervised Text Classification with Balanced Deep representation Distributions (S2TC-BDD). We implement both multi-class classification and multi-label classification versions of S2TC-BDD by introducing some pseudo-labeling tricks and regularization terms. To evaluate S2 TC-BDD, we compare it against the state-of-the-art SSTC methods. Empirical results demonstrate the effectiveness of S2 TC-BDD, especially when the labeled texts are scarce.
翻译:半监督文本分类(SSTC)主要遵循自训练原则,其核心流程为:首先通过有标注文本初始化深度分类器,随后交替执行对无标注文本的伪标签预测,以及基于有标注与伪标注混合数据对深度分类器进行训练。此类方法的性能高度依赖于无标注文本伪标签的准确性,然而由于SSTC中各类别表示分布存在显著差异而引发的边界偏差问题,伪标签准确性往往偏低。为解决此问题,本文引入角边界损失函数,并通过多重高斯线性变换实现标签角方差的平衡——即同一标签内文本角方差的均衡化。通过在自训练循环中对有标注与伪标注文本的标签角方差进行协同约束,可显著提升伪标签预测精度。基于此思想,我们提出新型SSTC方法——平衡深度表示分布半监督文本分类(S2TC-BDD)。通过引入伪标签优化策略与正则化项,分别实现多类分类与多标签分类两个版本。为评估S2TC-BDD性能,我们将其与当前最优SSTC方法进行对比实验,实证结果验证了该方法尤其在标注文本稀缺场景下的有效性。