Instruction-tuned Large Language Models (LLMs) have exhibited impressive language understanding and the capacity to generate responses that follow specific prompts. However, due to the computational demands associated with training these models, their applications often adopt a zero-shot setting. In this paper, we evaluate the zero-shot performance of two publicly accessible LLMs, ChatGPT and OpenAssistant, in the context of six Computational Social Science classification tasks, while also investigating the effects of various prompting strategies. Our experiments investigate the impact of prompt complexity, including the effect of incorporating label definitions into the prompt; use of synonyms for label names; and the influence of integrating past memories during foundation model training. The findings indicate that in a zero-shot setting, current LLMs are unable to match the performance of smaller, fine-tuned baseline transformer models (such as BERT-large). Additionally, we find that different prompting strategies can significantly affect classification accuracy, with variations in accuracy and F1 scores exceeding 10\%.
翻译:指令微调的大语言模型(LLMs)展现出卓越的语言理解能力,并能生成符合特定提示的响应。然而,由于训练这些模型所需的高昂计算成本,它们的应用常采用零样本设置。本文评估了两个公开可用的大语言模型——ChatGPT和OpenAssistant在六项计算社会科学分类任务中的零样本性能,同时研究了多种提示策略的影响。我们的实验探究了提示复杂度的影响,包括在提示中加入标签定义、使用标签名称的同义词,以及在基础模型训练过程中整合过往记忆的影响。研究结果表明,在零样本设置下,当前大语言模型无法匹敌较小规模、经过微调的基线Transformer模型(如BERT-large)的性能。此外,我们发现不同的提示策略会显著影响分类准确率,准确率和F1分数的变化幅度超过10%。