In the rapidly evolving landscape of Natural Language Processing (NLP), the use of Large Language Models (LLMs) for automated text annotation in social media posts has garnered significant interest. Despite the impressive innovations in developing LLMs like ChatGPT, their efficacy, and accuracy as annotation tools are not well understood. In this paper, we analyze the performance of eight open-source and proprietary LLMs for annotating the stance expressed in social media posts, benchmarking their performance against human annotators' (i.e., crowd-sourced) judgments. Additionally, we investigate the conditions under which LLMs are likely to disagree with human judgment. A significant finding of our study is that the explicitness of text expressing a stance plays a critical role in how faithfully LLMs' stance judgments match humans'. We argue that LLMs perform well when human annotators do, and when LLMs fail, it often corresponds to situations in which human annotators struggle to reach an agreement. We conclude with recommendations for a comprehensive approach that combines the precision of human expertise with the scalability of LLM predictions. This study highlights the importance of improving the accuracy and comprehensiveness of automated stance detection, aiming to advance these technologies for more efficient and unbiased analysis of social media.
翻译:在自然语言处理(NLP)领域快速演进的背景下,利用大语言模型(LLMs)对社交媒体帖子进行自动化文本标注引发了广泛关注。尽管ChatGPT等大语言模型的开发取得了显著创新,但将其作为标注工具的效能与准确性仍不明确。本文系统分析了八种开源与专有的大语言模型在标注社交媒体帖子中所表达立场时的表现,以人类标注者(即众包)的判断为基准进行性能对比。此外,我们探究了可能导致大语言模型与人类判断产生分歧的特定条件。研究发现,表达立场的文本显式性对于大语言模型的立场判断与人类判断的匹配度具有关键影响。我们认为,当人类标注者表现良好时,大语言模型亦能取得理想效果;而当大语言模型失效时,往往对应于人类标注者难以达成共识的情境。最终,我们提出融合人类专家精准性与大语言模型可扩展性的综合方法建议。本研究强调了提升自动化立场检测准确性与全面性的重要意义,旨在推动相关技术更高效、更无偏地分析社交媒体内容。