Scalable Qualitative Coding with LLMs: Chain-of-Thought Reasoning Matches Human Performance in Some Hermeneutic Tasks

Qualitative coding, or content analysis, extracts meaning from text to discern quantitative patterns across a corpus of texts. Recently, advances in the interpretive abilities of large language models (LLMs) offer potential for automating the coding process (applying category labels to texts), thereby enabling human researchers to concentrate on more creative research aspects, while delegating these interpretive tasks to AI. Our case study comprises a set of socio-historical codes on dense, paragraph-long passages representative of a humanistic study. We show that GPT-4 is capable of human-equivalent interpretations, whereas GPT-3.5 is not. Compared to our human-derived gold standard, GPT-4 delivers excellent intercoder reliability (Cohen's $\kappa \geq 0.79$) for 3 of 9 codes, and substantial reliability ($\kappa \geq 0.6$) for 8 of 9 codes. In contrast, GPT-3.5 greatly underperforms for all codes ($mean(\kappa) = 0.34$; $max(\kappa) = 0.55$). Importantly, we find that coding fidelity improves considerably when the LLM is prompted to give rationale justifying its coding decisions (chain-of-thought reasoning). We present these and other findings along with a set of best practices for adapting traditional codebooks for LLMs. Our results indicate that for certain codebooks, state-of-the-art LLMs are already adept at large-scale content analysis. Furthermore, they suggest the next generation of models will likely render AI coding a viable option for a majority of codebooks.

翻译：定性编码（或称内容分析）旨在从文本中提取意义，以揭示语料库中文本间的量化模式。近年来，大语言模型（LLMs）在解释能力上的进步为自动化编码过程（即向文本应用类别标签）提供了可能，从而使人类研究者能够专注于更具创造性的研究层面，将这类解释性任务委托给人工智能。我们的案例研究包含一组针对密集段落长文本的社会历史编码，这些段落代表了人文学科研究的典型特征。我们证明，GPT-4能够实现与人类等效的解释，而GPT-3.5则无法做到。与人类制定的黄金标准相比，GPT-4在9个编码中的3个上表现出优秀的编码者间信度（Cohen's $\kappa \geq 0.79$），在9个编码中的8个上达到显著的信度水平（$\kappa \geq 0.6$）。相比之下，GPT-3.5在所有编码上的表现均远低于标准（均值 $\kappa = 0.34$；最大值 $\kappa = 0.55$）。重要的是，我们发现，当LLM被提示提供解释其编码决策的理由（即链式思维推理）时，编码保真度显著提高。我们呈现了这些发现及其他结果，并总结了一套针对LLM调整传统编码手册的最佳实践。我们的结果表明，对于某些编码手册，最先进的LLM已经能够胜任大规模内容分析。此外，这些发现表明，下一代模型很可能使AI编码成为大多数编码手册的可行选项。