Instruction-tuned Large Language Models (LLMs) have exhibited impressive language understanding and the capacity to generate responses that follow specific prompts. However, due to the computational demands associated with training these models, their applications often adopt a zero-shot setting. In this paper, we evaluate the zero-shot performance of two publicly accessible LLMs, ChatGPT and OpenAssistant, in the context of six Computational Social Science classification tasks, while also investigating the effects of various prompting strategies. Our experiments investigate the impact of prompt complexity, including the effect of incorporating label definitions into the prompt; use of synonyms for label names; and the influence of integrating past memories during foundation model training. The findings indicate that in a zero-shot setting, current LLMs are unable to match the performance of smaller, fine-tuned baseline transformer models (such as BERT-large). Additionally, we find that different prompting strategies can significantly affect classification accuracy, with variations in accuracy and F1 scores exceeding 10\%.
翻译:指令微调的大语言模型(LLMs)展现出卓越的语言理解能力及生成特定提示响应能力。然而,受限于训练这些模型所需的计算资源,其应用通常采用零样本设置。本文在六个计算社会科学分类任务中,评估了两个公开可用的大语言模型ChatGPT与OpenAssistant的零样本性能,同时探究了多种提示策略的影响。实验揭示了提示复杂度的影响机制,包括:在提示中纳入标签定义的效果;标签名称使用同义词的影响;以及基础模型训练过程中整合历史记忆的作用。研究结果表明,在零样本设置下,现有大语言模型无法匹敌经过微调的小型基准Transformer模型(如BERT-large)的性能。此外,我们发现不同提示策略会显著影响分类准确率,其准确率与F1分数的波动幅度超过10%。