Multiclass Classification of Policy Documents with Large Language Models

Classifying policy documents into policy issue topics has been a long-time effort in political science and communication disciplines. Efforts to automate text classification processes for social science research purposes have so far achieved remarkable results, but there is still a large room for progress. In this work, we test the prediction performance of an alternative strategy, which requires human involvement much less than full manual coding. We use the GPT 3.5 and GPT 4 models of the OpenAI, which are pre-trained instruction-tuned Large Language Models (LLM), to classify congressional bills and congressional hearings into Comparative Agendas Project's 21 major policy issue topics. We propose three use-case scenarios and estimate overall accuracies ranging from %58-83 depending on scenario and GPT model employed. The three scenarios aims at minimal, moderate, and major human interference, respectively. Overall, our results point towards the insufficiency of complete reliance on GPT with minimal human intervention, an increasing accuracy along with the human effort exerted, and a surprisingly high accuracy achieved in the most humanly demanding use-case. However, the superior use-case achieved the %83 accuracy on the %65 of the data in which the two models agreed, suggesting that a similar approach to ours can be relatively easily implemented and allow for mostly automated coding of a majority of a given dataset. This could free up resources allowing manual human coding of the remaining %35 of the data to achieve an overall higher level of accuracy while reducing costs significantly.

翻译：将政策文件归类为政策议题主题一直是政治学和传播学领域的长期研究方向。为社会科学研究目的而进行的文本分类自动化进程迄今已取得显著成果，但仍存在较大提升空间。本研究测试了一种所需人工干预远低于完全人工编码的替代策略的预测性能。我们采用OpenAI公司经过预训练的指令调优大语言模型GPT 3.5和GPT 4，对美国国会法案和听证会文件进行"比较议程项目"框架下21个主要政策议题主题的分类。我们提出了三种应用场景，并根据具体场景及所采用的GPT模型，总体准确率介于58%-83%之间。这三种场景分别对应最小、中等和最大程度的人工干预。总体而言，研究结果表明：完全依赖GPT且仅需最小人工干预的方法存在不足；准确率随人工投入增加而提升；在人工需求最高的应用场景中实现了令人惊讶的高准确率。值得注意的是，在两个模型达成一致意见的65%数据中，最优场景达到了83%的准确率，这表明类似我们的研究方法可以较为简便地实施，并实现对给定数据集中大部分数据的自动化编码。这可以释放资源，使剩余35%的数据得以通过人工编码实现整体更高准确率，同时显著降低研究成本。