Large language models (LLMs) offer strategy researchers powerful tools for annotating text at scale, but treating LLM-generated labels as deterministic overlooks substantial instability. Grounded in content analysis and generalizability theory, we diagnose five variance sources: construct specification, interface effects, model preferences, output extraction, and system-level aggregation. Empirical demonstrations show that minor design choices-prompt phrasing, model selection-can shift outcomes by 12-85 percentage points. Such variance threatens not only reproducibility but econometric identification: annotation errors correlated with covariates bias parameter estimates regardless of average accuracy. We develop a variance-aware protocol specifying sampling budgets, aggregation rules, and reporting standards, and delineate scope conditions where LLM annotation should not be used. These contributions transform LLM-based annotation from ad hoc practice into auditable measurement infrastructure.
翻译:大型语言模型(LLM)为战略研究者提供了大规模标注文本的强大工具,但将LLM生成的标签视为确定性结果会忽略其显著的不稳定性。基于内容分析和概化理论,我们诊断了五个方差来源:构念界定、界面效应、模型偏好、输出提取和系统级聚合。实证研究表明,细微的设计选择——如提示措辞、模型选择——可使结果产生12至85个百分点的偏移。这种方差不仅威胁研究的可重复性,也影响计量识别:与协变量相关的标注误差会扭曲参数估计,即使平均准确率较高。我们开发了一种考虑方差的协议,明确了抽样预算、聚合规则和报告标准,并界定了不应使用LLM标注的适用范围。这些贡献将基于LLM的标注从临时性实践转化为可审计的测量基础设施。