Assessing classification confidence is critical for leveraging large language models (LLMs) in automated labeling tasks, especially in the sensitive domains presented by Computational Social Science (CSS) tasks. In this paper, we make three key contributions: (1) we propose an uncertainty quantification (UQ) performance measure tailored for data annotation tasks, (2) we compare, for the first time, five different UQ strategies across three distinct LLMs and CSS data annotation tasks, (3) we introduce a novel UQ aggregation strategy that effectively identifies low-confidence LLM annotations and disproportionately uncovers data incorrectly labeled by the LLMs. Our results demonstrate that our proposed UQ aggregation strategy improves upon existing methods andcan be used to significantly improve human-in-the-loop data annotation processes.
翻译:评估分类置信度对于在自动标注任务中利用大型语言模型(LLM)至关重要,尤其是在计算社会科学(CSS)任务所涉及的敏感领域。本文作出三项关键贡献:(1)提出一种专为数据标注任务定制的不确定性量化(UQ)性能度量方法;(2)首次在三种不同的LLM和CSS数据标注任务中比较五种不同的UQ策略;(3)引入一种新颖的UQ聚合策略,该策略能有效识别LLM的低置信度标注,并显著发现LLM错误标注的数据。我们的结果表明,所提出的UQ聚合策略优于现有方法,可用于显著改进人机协同数据标注流程。