Qualitative coding is central to social science, but expert annotation is difficult to scale. LLMs offer a possible extension, yet require careful validation when the target construct is interpretive, theoretically loaded, and only indirectly expressed. We study this problem in a difficult case: detecting whether authors treat Bayesian models as descriptions of mental and neural mechanisms (realism) or as useful mathematical tools (instrumentalism). Our method combines a theory-driven codebook, expert-coded reference annotations, a diagnostic-gated prompt-optimization search yielding a shared zero-shot prompt for three frontier LLMs (GPT-5.1, Claude Sonnet 4.6, Gemini 3 Pro Preview), and multi-rater reliability analysis. The final prompt achieved a held-out combined reliability score of 0.76 (harmonic mean of ICC = 0.79 and $α$ = 0.74), with all diagnostics satisfied. Deployed on 6,858 quotes from 210 articles, the three LLMs reached substantial quote-level agreement (ICC = 0.80; $α$ = 0.76; combined = 0.78) and near-perfect article-level rank stability ($r$ = 0.96-0.97 across rater pairs). The corpus was predominantly weakly realist, but article-level stances were rarely uniform: only 1.4% of articles used a single band, while 59.5% spanned four or more. Low-level perception/motor articles scored 8.8 Realism points higher than high-level cognition articles ($p < .001$, $d = 0.60$), quantifying a long-held qualitative intuition. We present this as an expert-led case study; the framework is intended to generalize to similar theoretically demanding tasks, not to all qualitative analysis.
翻译:定性编码是社会科学的核心,但专家标注难以规模化。大语言模型(LLM)提供了可能的扩展途径,但当目标构念具有诠释性、理论负荷且仅间接表达时,需要仔细验证。我们在一个困难案例中研究此问题:检测作者是将贝叶斯模型视为心理与神经机制的描述(实在论),还是作为有用的数学工具(工具主义)。我们的方法结合了理论驱动的编码手册、专家编码的参考标注、诊断门控的提示优化搜索(为三种前沿LLM——GPT-5.1、Claude Sonnet 4.6、Gemini 3 Pro预览版——生成共享零样本提示),以及多评分者信度分析。最终提示在保留数据集上实现了0.76的综合信度得分(ICC=0.79与α=0.74的调和平均数),所有诊断指标均达标。将提示应用于来自210篇论文的6,858条引文后,三种LLM达到了显著的引文级别一致性(ICC=0.80;α=0.76;综合=0.78)和近乎完美的论文级别排序稳定性(评分者对间r=0.96-0.97)。语料库整体呈现弱实在论倾向,但论文级别立场极少统一:仅1.4%的论文使用单一频段,而59.5%的论文横跨四个及以上频段。低级感知/运动领域论文的实在论得分比高级认知领域论文高8.8分(p<.001,d=0.60),量化了长期以来的质性直觉。我们将此作为专家主导的案例研究呈现;该框架旨在推广至类似的理论密集型任务,而非所有定性分析。