Large language models (LLMs) offer strategy researchers powerful tools for annotating text at scale, but treating LLM-generated labels as deterministic overlooks substantial instability. Grounded in content analysis and generalizability theory, we diagnose five variance sources: construct specification, interface effects, model preferences, output extraction, and system-level aggregation. Empirical demonstrations show that minor design choices-prompt phrasing, model selection-can shift outcomes by 12-85 percentage points. Such variance threatens not only reproducibility but econometric identification: annotation errors correlated with covariates bias parameter estimates regardless of average accuracy. We develop a variance-aware protocol specifying sampling budgets, aggregation rules, and reporting standards, and delineate scope conditions where LLM annotation should not be used. These contributions transform LLM-based annotation from ad hoc practice into auditable measurement infrastructure.
翻译:大语言模型为战略研究者提供了大规模文本标注的强大工具,但将LLM生成的标签视为确定性结果会忽略显著的不稳定性。基于内容分析与概化理论,我们诊断出五个方差来源:构念界定、界面效应、模型偏好、输出提取和系统级聚合。实证研究表明,细微的设计选择——如提示语表述、模型选择——可使结果产生12-85个百分点的偏移。此类方差不仅威胁研究的可复现性,更影响计量识别:与协变量相关的标注误差会导致参数估计偏误,即使平均准确率较高亦然。我们开发了一套方差感知型方案,明确规定了抽样预算、聚合规则与报告标准,并界定了不适用LLM标注的范围条件。这些贡献将基于LLM的标注从临时性实践转化为可审计的测量基础设施。