Recent research highlights the significant potential of ChatGPT for text annotation in social science research. However, ChatGPT is a closed-source product which has major drawbacks with regards to transparency, reproducibility, cost, and data protection. Recent advances in open-source (OS) large language models (LLMs) offer alternatives which remedy these challenges. This means that it is important to evaluate the performance of OS LLMs relative to ChatGPT and standard approaches to supervised machine learning classification. We conduct a systematic comparative evaluation of the performance of a range of OS LLM models alongside ChatGPT, using both zero- and few-shot learning as well as generic and custom prompts, with results compared to more traditional supervised classification models. Using a new dataset of Tweets from US news media, and focusing on simple binary text annotation tasks for standard social science concepts, we find significant variation in the performance of ChatGPT and OS models across the tasks, and that supervised classifiers consistently outperform both. Given the unreliable performance of ChatGPT and the significant challenges it poses to Open Science we advise against using ChatGPT for substantive text annotation tasks in social science research.
翻译:近期研究突显了ChatGPT在社会科学文本标注中的显著潜力。然而,ChatGPT作为闭源产品,在透明度、可复现性、成本及数据保护方面存在重大缺陷。开源大语言模型的最新进展提供了应对这些挑战的替代方案。因此,评估开源大语言模型相较于ChatGPT及传统监督机器学习分类方法的性能至关重要。我们系统比较了多款开源大语言模型与ChatGPT的表现,采用零样本学习、少样本学习、通用提示及定制提示,并将结果与传统的监督分类模型进行对比。基于美国新闻媒体推文的新数据集,聚焦社会科学标准概念的简单二分类文本标注任务,我们发现ChatGPT与开源模型在不同任务中的性能存在显著差异,且监督分类器的表现始终优于两者。鉴于ChatGPT性能不可靠且对开放科学构成重大挑战,我们建议社会科学研究中的实质性文本标注任务不宜采用ChatGPT。