Large language models (LLMs) allow for sophisticated qualitative coding of large datasets, but zero- and few-shot classifiers can produce an intolerable number of errors, even with careful, validated prompting. We present a simple, generalizable two-stage workflow: an LLM applies a human-designed, LLM-adapted codebook; a secondary LLM critic performs self-reflection on each positive label by re-reading the source text alongside the first model's rationale and issuing a final decision. We evaluate this approach on six qualitative codes over 3,000 high-content emails from Apache Software Foundation project evaluation discussions. Our human-derived audit of 360 positive annotations (60 passages by six codes) found that the first-line LLM had a false-positive rate of 8% to 54%, despite F1 scores of 0.74 and 1.00 in testing. Subsequent recoding of all stage-one annotations via a second self-reflection stage improved F1 by 0.04 to 0.25, bringing two especially poor performing codes up to 0.69 and 0.79 from 0.52 and 0.55 respectively. Our manual evaluation identified two recurrent error classes: misinterpretation (violations of code definitions) and meta-discussion (debate about a project evaluation criterion mistaken for its use as a decision justification). Code-specific critic clauses addressing observed failure modes were especially effective with testing and refinement, replicating the codebook-adaption process for LLM interpretation in stage-one. We explain how favoring recall in first-line LLM annotation combined with secondary critique delivers precision-first, compute-light control. With human guidance and validation, self-reflection slots into existing LLM-assisted annotation pipelines to reduce noise and potentially salvage unusable classifiers.
翻译:大型语言模型(LLM)能够对大规模数据集进行复杂的定性编码,但零样本和少样本分类器即使经过精心设计和验证的提示,仍可能产生难以容忍的错误数量。本文提出一种简单且可推广的两阶段工作流程:首先由LLM应用人工设计、适配于LLM的编码手册;随后由次级LLM批判器对每个阳性标注进行自我反思,通过重新阅读源文本并结合第一阶段模型的推理依据,最终做出判定。我们在Apache软件基金会项目评估讨论中的3000封高内容邮件上,针对六种定性编码评估了该方法。通过对360个阳性标注(六种编码各60段文本)的人工审计发现,尽管测试中F1分数达到0.74至1.00,一线LLM的假阳性率仍高达8%至54%。通过第二阶段的自我反思对所有第一阶段标注进行重新编码后,F1分数提升了0.04至0.25,其中两个表现较差的编码分别从0.52和0.55提升至0.69和0.79。人工评估识别出两类反复出现的错误:误解(违反编码定义)和元讨论(将关于项目评估标准的讨论误判为决策论证)。针对已观察到的失效模式设计的编码特异性批判条款,经过测试和优化后效果显著,这复现了第一阶段中为LLM解释适配编码手册的过程。我们阐释了如何通过在一线LLM标注中侧重召回率,并结合次级批判机制,实现以精确率为先、计算量轻量化的控制。在人工指导和验证下,自我反思机制可嵌入现有LLM辅助标注流程,有效降低噪声并有望挽救原本不可用的分类器。