In text classification tasks, fine tuning pretrained language models like BERT and GPT-3 yields competitive accuracy; however, both methods require pretraining on large text datasets. In contrast, general topic modeling methods possess the advantage of analyzing documents to extract meaningful patterns of words without the need of pretraining. To leverage topic modeling's unsupervised insights extraction on text classification tasks, we develop the Knowledge Distillation Semi-supervised Topic Modeling (KDSTM). KDSTM requires no pretrained embeddings, few labeled documents and is efficient to train, making it ideal under resource constrained settings. Across a variety of datasets, our method outperforms existing supervised topic modeling methods in classification accuracy, robustness and efficiency and achieves similar performance compare to state of the art weakly supervised text classification methods.
翻译:在文本分类任务中,微调预训练语言模型(如BERT和GPT-3)可获得具有竞争力的准确率,但两种方法均需在大型文本数据集上进行预训练。相比之下,通用主题建模方法具备无需预训练即可分析文档并提取有意义的词汇模式的优势。为利用主题建模在文本分类任务中的无监督洞察提取能力,我们提出了知识蒸馏半监督主题建模(KDSTM)。KDSTM无需预训练词嵌入,只需少量标注文档,且训练高效,因此非常适合资源受限的场景。在多种数据集上,我们的方法在分类准确率、鲁棒性和效率方面均优于现有监督主题建模方法,并与当前最先进的弱监督文本分类方法性能相当。