Manually annotating data for computational social science tasks can be costly, time-consuming, and emotionally draining. While recent work suggests that LLMs can perform such annotation tasks in zero-shot settings, little is known about how prompt design impacts LLMs' compliance and accuracy. We conduct a large-scale multi-prompt experiment to test how model selection (ChatGPT, PaLM2, and Falcon7b) and prompt design features (definition inclusion, output type, explanation, and prompt length) impact the compliance and accuracy of LLM-generated annotations on four CSS tasks (toxicity, sentiment, rumor stance, and news frames). Our results show that LLM compliance and accuracy are highly prompt-dependent. For instance, prompting for numerical scores instead of labels reduces all LLMs' compliance and accuracy. The overall best prompting setup is task-dependent, and minor prompt changes can cause large changes in the distribution of generated labels. By showing that prompt design significantly impacts the quality and distribution of LLM-generated annotations, this work serves as both a warning and practical guide for researchers and practitioners.
翻译:手动标注计算社会科学任务的数据可能成本高昂、耗时且令人精神疲惫。尽管近期研究表明大语言模型(LLM)能够在零样本场景下执行此类标注任务,但提示设计如何影响大语言模型的遵循度与准确性仍不明确。我们通过大规模多提示实验,测试模型选择(ChatGPT、PaLM2 和 Falcon7b)与提示设计要素(定义包含、输出类型、解释说明及提示长度)如何影响大语言模型在四项计算社会科学任务(毒性检测、情感分析、谣言立场识别及新闻框架分类)中生成标注的遵循度与准确性。实验结果表明,大语言模型的遵循度与准确性高度依赖提示设计。例如,要求输出数值分数而非分类标签会降低所有大语言模型的遵循度与准确性。最优提示方案因任务而异,细微的提示调整可能导致生成标签分布的重大变化。本研究通过揭示提示设计对大语言模型生成标注的质量与分布具有显著影响,为研究者和实践者提供了警示与实用指南。