Instruction-tuned Large Language Models (LLMs) have exhibited impressive language understanding and the capacity to generate responses that follow specific instructions. However, due to the computational demands associated with training these models, their applications often rely on zero-shot settings. In this paper, we evaluate the zero-shot performance of two publicly accessible LLMs, ChatGPT and OpenAssistant, in the context of Computational Social Science classification tasks, while also investigating the effects of various prompting strategies. Our experiment considers the impact of prompt complexity, including the effect of incorporating label definitions into the prompt, using synonyms for label names, and the influence of integrating past memories during the foundation model training. The findings indicate that in a zero-shot setting, the current LLMs are unable to match the performance of smaller, fine-tuned baseline transformer models (such as BERT). Additionally, we find that different prompting strategies can significantly affect classification accuracy, with variations in accuracy and F1 scores exceeding 10%.
翻译:指令调优的大语言模型展现出令人印象深刻的语言理解能力以及生成遵循特定指令的响应的能力。然而,由于训练这些模型带来的计算需求,其应用通常依赖零样本环境。本文评估了两种公开可用的大语言模型ChatGPT和OpenAssistant在计算社会科学分类任务中的零样本性能,同时研究了多种提示策略的影响。我们的实验考虑了提示复杂度的作用,包括在提示中融入标签定义、使用标签名称的同义词,以及基础模型训练过程中整合过往记忆的影响。研究结果表明,在零样本环境下,当前的大语言模型无法媲美更小、经过微调的基准Transformer模型(如BERT)的性能。此外,我们发现不同的提示策略会显著影响分类准确率,其准确率和F1分数的波动超过10%。