Spurred by advancements in scale, large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot -- i.e., without adaptation on downstream data. Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community due to the fact that it can generate high-quality responses to human input and self-correct previous mistakes based on subsequent conversations. However, it is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot. In this work, we empirically analyze the zero-shot learning ability of ChatGPT by evaluating it on 20 popular NLP datasets covering 7 representative task categories. With extensive empirical studies, we demonstrate both the effectiveness and limitations of the current version of ChatGPT. We find that ChatGPT performs well on many tasks favoring reasoning capabilities (e.g., arithmetic reasoning) while it still faces challenges when solving specific tasks such as sequence tagging. We additionally provide in-depth analysis through qualitative case studies.
翻译:受规模提升的推动,大型语言模型已展现出无需下游数据适配即可零样本执行多种自然语言处理任务的能力。近期,ChatGPT 的亮相因其能够生成高质量的人类输入响应并在后续对话中自我纠错,引发了自然语言处理领域的广泛关注。然而,ChatGPT 能否作为通用模型零样本执行多种自然语言处理任务尚不明确。本研究通过评估 ChatGPT 在覆盖 7 个代表性任务类别的 20 个主流自然语言处理数据集上的表现,对其零样本学习能力进行了实证分析。通过大规模实证研究,我们揭示了当前版本 ChatGPT 的有效性与局限性。研究发现,ChatGPT 在算术推理等需要推理能力的任务上表现优异,但在序列标注等特定任务中仍面临挑战。我们进一步通过定性案例研究提供了深度分析。