This paper introduces a novel semi-supervised learning framework specifically designed for text classification tasks, effectively addressing the challenge of vast datasets with limited labeled examples. By integrating multi-level similarity based data augmentation techniques from Retrieval-Augmented Generation (RAG) to Large Language Model (LLM) rewriting and traditional word substitution-we constructed an intelligent augmentation pipeline. This framework innovatively employs the selection of representative landmarks through clustering, which serve as intermediaries in the retrieval and rewriting processes, ensuring that the augmented data maintains a distribution similar to the original dataset. Empirical results show that even in complex text document classification scenarios with over 100 categories, our method achieves state-of-the-art accuracies of 95.41% and 82.43% on the Reuters and Web of Science datasets, respectively. These findings highlight the effectiveness and broad applicability of our semi-supervised learning approach for text classification tasks.
翻译:本文提出了一种专为文本分类任务设计的新型半监督学习框架,有效解决了大规模数据集中标注样本有限的问题。通过整合基于多级相似度的数据增强技术——从检索增强生成(RAG)到大型语言模型(LLM)重写及传统词汇替换——我们构建了智能化增强流程。该框架创新性地采用聚类算法选取代表性锚点,使其作为检索与重写过程的媒介,确保增强数据保持与原始数据集相似的分布特性。实验结果表明,即使在超过100个类别的复杂文本文档分类场景中,我们的方法在路透社和Web of Science数据集上分别达到了95.41%与82.43%的顶尖准确率。这些发现凸显了本半监督学习方法在文本分类任务中的有效性与广泛适用性。