Recent developments in text classification using Large Language Models (LLMs) in the social sciences suggest that costs can be cut significantly, while performance can sometimes rival existing computational methods. However, with a wide variance in performance in current tests, we move to the question of how to maximize performance. In this paper, we focus on prompt context as a possible avenue for increasing accuracy by systematically varying three aspects of prompt engineering: label descriptions, instructional nudges, and few shot examples. Across two different examples, our tests illustrate that a minimal increase in prompt context yields the highest increase in performance, while further increases in context only tend to yield marginal performance increases thereafter. Alarmingly, increasing prompt context sometimes decreases accuracy. Furthermore, our tests suggest substantial heterogeneity across models, tasks, and batch size, underlining the need for individual validation of each LLM coding task rather than reliance on general rules.
翻译:近期基于大语言模型(LLM)的社会科学文本分类研究表明,该方法可显著降低研究成本,且其性能在某些情况下可与现有计算方法相媲美。然而,鉴于当前测试中性能存在显著差异,如何最大化分类性能成为关键问题。本文聚焦于提示语境这一优化路径,通过系统性地调整提示工程的三个维度——标签描述、指令引导以及少样本示例——来提升分类准确率。基于两类不同实验的测试表明:适度的提示语境增量能带来性能的最大提升,而持续增加语境信息仅能产生边际效益。令人警惕的是,过度丰富的提示语境有时反而会降低分类准确率。此外,我们的测试揭示了模型类型、任务特征及批处理规模间的显著异质性,凸显了为每个LLM编码任务进行独立验证的重要性,而非依赖通用规则。