This paper presents CAALM-TC (Combining Autoregressive and Autoencoder Language Models for Text Classification), a novel method that enhances text classification by integrating autoregressive and autoencoder language models. Autoregressive large language models such as Open AI's GPT, Meta's Llama or Microsoft's Phi offer promising prospects for content analysis practitioners, but they generally underperform supervised BERT based models for text classification. CAALM leverages autoregressive models to generate contextual information based on input texts, which is then combined with the original text and fed into an autoencoder model for classification. This hybrid approach capitalizes on the extensive contextual knowledge of autoregressive models and the efficient classification capabilities of autoencoders. Experimental results on four benchmark datasets demonstrate that CAALM consistently outperforms existing methods, particularly in tasks with smaller datasets and more abstract classification objectives. The findings indicate that CAALM offers a scalable and effective solution for automated content analysis in social science research that minimizes sample size requirements.
翻译:本文提出CAALM-TC(结合自回归与自编码语言模型的文本分类方法),这是一种通过整合自回归和自编码语言模型来增强文本分类性能的新方法。自回归大语言模型(如Open AI的GPT、Meta的Llama或Microsoft的Phi)为内容分析研究者提供了广阔前景,但在文本分类任务中通常表现不及基于BERT的有监督模型。CAALM利用自回归模型基于输入文本生成上下文信息,随后将该信息与原始文本结合并输入自编码模型进行分类。这种混合方法充分发挥了自回归模型广泛的情境知识优势与自编码模型高效的分类能力。在四个基准数据集上的实验结果表明,CAALM始终优于现有方法,尤其在数据规模较小且分类目标更抽象的任务中表现突出。研究结果表明,CAALM为社会科学研究中的自动化内容分析提供了一种可扩展且高效的解决方案,同时显著降低了对样本量的要求。