From Stochasticity to Signal: A Bayesian Latent State Model for Reliable Measurement with LLMs

Large Language Models (LLMs) are increasingly used to automate classification tasks in business, such as analyzing customer satisfaction from text. However, the inherent stochasticity of LLMs can create measurement error when the outcome is considered deterministic. This problem is often neglected with the empirical practice of a single round of output, or addressed with ad-hoc methods like majority voting. Such naive approaches fail to quantify uncertainty and can produce biased estimates of population-level metrics. In this paper, we propose a formal statistical solution by introducing a Bayesian latent state model to address it. Our model treats the true classification as a latent variable and the multiple LLM ratings as noisy measurements of this outcome state. This framework jointly estimates LLM error rates, population-level outcome rates, individual-level probabilities of the outcome, and the causal impact of interventions, if any, on the outcome. The methodology is applicable to both fully unsupervised and semi-supervised settings, where ground truth labels are unavailable or available for only a subset of the classification targets. We provide formal theoretical conditions and proofs for the strict identifiability of the model parameters. Through simulation studies, we demonstrate that our model accurately recovers true parameters, showing superior performance and capabilities compared to other methods. We provide tailored recommendations of modeling choices based on the difficulty level of the task. We also apply it to a real-world case study analyzing over 14,000 customer support transcripts. We conclude that this methodology provides a general framework for converting probabilistic outputs from LLMs into reliable insights for scientific and business applications.

翻译：大语言模型（LLMs）正越来越多地被用于自动化商业分类任务，例如分析文本中的客户满意度。然而，当结果被视为确定性时，LLMs的内在随机性可能导致测量误差。这一问题常被忽视（例如采用单轮输出的经验性做法），或仅通过多数投票等临时方法处理。这些朴素方法无法量化不确定性，且可能产生有偏的群体水平指标估计。本文提出一种正式的统计解决方案，通过引入贝叶斯隐状态模型来应对该问题。该模型将真实分类视为隐变量，并将多次LLM评分视为对该结果状态的带噪测量。该框架联合估计LLM错误率、群体水平的结果率、个体水平的结果概率，以及（若有的话）干预对结果的因果影响。该方法适用于完全无监督和半监督场景（即真实标签不可用或仅部分分类目标具有标签）。我们给出了模型参数严格可识别性的正式理论条件与证明。通过仿真研究，我们证明该模型能准确恢复真实参数，展现出优于其他方法的性能与能力。我们根据任务难度提供了针对性的建模选择建议。此外，我们将该方法应用于一个真实案例研究，分析了超过14,000份客户支持对话记录。结论表明，该框架为将LLM的概率性输出转化为科学和商业应用的可靠见解提供了一种通用范式。