Text classification is a crucial and fundamental task in web content mining. Compared with the previous learning paradigm of pre-training and fine-tuning by cross entropy loss, the recently proposed supervised contrastive learning approach has received tremendous attention due to its powerful feature learning capability and robustness. Although several studies have incorporated this technique for text classification, some limitations remain. First, many text datasets are imbalanced, and the learning mechanism of supervised contrastive learning is sensitive to data imbalance, which may harm the model's performance. Moreover, these models leverage separate classification branches with cross entropy and supervised contrastive learning branches without explicit mutual guidance. To this end, we propose a novel model named SharpReCL for imbalanced text classification tasks. First, we obtain the prototype vector of each class in the balanced classification branch to act as a representation of each class. Then, by further explicitly leveraging the prototype vectors, we construct a proper and sufficient target sample set with the same size for each class to perform the supervised contrastive learning procedure. The empirical results show the effectiveness of our model, which even outperforms popular large language models across several datasets. Our code is available here.
翻译:文本分类是网络内容挖掘中一项关键且基础的任务。与先前通过交叉熵损失进行预训练和微调的学习范式相比,最近提出的监督对比学习方法因其强大的特征学习能力和鲁棒性而受到极大关注。尽管已有若干研究将该技术应用于文本分类,但仍存在一些局限性。首先,许多文本数据集存在类别不平衡问题,而监督对比学习的学习机制对数据不平衡较为敏感,这可能损害模型性能。此外,现有模型通常采用独立的分类分支(交叉熵损失)与监督对比学习分支,缺乏明确的相互指导机制。为此,我们提出了一种名为SharpReCL的新型模型,专门用于不平衡文本分类任务。首先,我们在平衡分类分支中获取每个类别的原型向量,作为各类别的表征。随后,通过进一步显式利用这些原型向量,我们为每个类别构建一个规模适当且充分的目标样本集,以执行监督对比学习过程。实证结果表明,我们的模型具有显著有效性,在多个数据集上甚至超越了流行的大型语言模型。我们的代码已公开。