Text classification is a crucial and fundamental task in natural language processing. Compared with the previous learning paradigm of pre-training and fine-tuning by cross entropy loss, the recently proposed supervised contrastive learning approach has received tremendous attention due to its powerful feature learning capability and robustness. Although several studies have incorporated this technique for text classification, some limitations remain. First, many text datasets are imbalanced, and the learning mechanism of supervised contrastive learning is sensitive to data imbalance, which may harm the model performance. Moreover, these models leverage separate classification branch with cross entropy and supervised contrastive learning branch without explicit mutual guidance. To this end, we propose a novel model named SharpReCL for imbalanced text classification tasks. First, we obtain the prototype vector of each class in the balanced classification branch to act as a representation of each class. Then, by further explicitly leveraging the prototype vectors, we construct a proper and sufficient target sample set with the same size for each class to perform the supervised contrastive learning procedure. The empirical results show the effectiveness of our model, which even outperforms popular large language models across several datasets.
翻译:文本分类是自然语言处理中一项关键且基础的任务。与以往基于交叉熵损失的预训练微调学习范式相比,近年来提出的监督对比学习方法因其强大的特征学习能力和鲁棒性而受到广泛关注。尽管已有研究将这一技术应用于文本分类,但仍存在若干局限。首先,许多文本数据集呈现不平衡分布,而监督对比学习的学习机制对数据不平衡较为敏感,这可能会损害模型性能。此外,现有模型分别采用交叉熵损失驱动的分类分支和监督对比学习分支,两者之间缺乏显式的相互指导。为此,我们针对不平衡文本分类任务提出一种名为SharpReCL的新型模型。首先,在平衡分类分支中获取每个类别的原型向量作为该类的表征。进而通过显式利用这些原型向量,为每个类别构建大小恰当且充分的样本集,以执行监督对比学习过程。实验结果表明,该模型的有效性显著,甚至在某些数据集上优于主流的大语言模型。