Researchers are increasingly using language models (LMs) for text annotation. These approaches rely only on a prompt telling the model to return a given output according to a set of instructions. The reproducibility of LM outputs may nonetheless be vulnerable to small changes in the prompt design. This calls into question the replicability of classification routines. To tackle this problem, researchers have typically tested a variety of semantically similar prompts to determine what we call "prompt stability." These approaches remain ad-hoc and task specific. In this article, we propose a general framework for diagnosing prompt stability by adapting traditional approaches to intra- and inter-coder reliability scoring. We call the resulting metric the Prompt Stability Score (PSS) and provide a Python package PromptStability for its estimation. Using six different datasets and twelve outcomes, we classify >150k rows of data to: a) diagnose when prompt stability is low; and b) demonstrate the functionality of the package. We conclude by providing best practice recommendations for applied researchers.
翻译:研究人员正日益广泛地采用语言模型进行文本标注。这类方法仅依赖于提示指令,即要求模型根据一组指令返回指定输出。然而,语言模型输出的可复现性可能对提示设计的细微变化极为敏感,这对分类流程的可重复性提出了质疑。为解决此问题,研究者通常通过测试多种语义相似的提示来评估所谓的"提示稳定性",但现有方法仍停留在临时性且任务特定的层面。本文提出一个通用框架,通过调整传统的编码者内与编码者间信度评分方法,构建提示稳定性的诊断体系。我们将所得度量指标命名为提示稳定性分数,并开发了相应的Python软件包PromptStability用于计算。通过对六个不同数据集、十二类标注任务中超过15万行数据的分类实验,我们实现了双重目标:a) 诊断提示稳定性较低的场景;b) 验证软件包的功能性。最后,我们为应用研究者提供了最佳实践建议。