This paper proposes a framework for quantitatively evaluating interactive LLMs such as ChatGPT using publicly available data sets. We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks. We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset. We find that ChatGPT outperforms LLMs with zero-shot learning on most tasks and even outperforms fine-tuned models on some tasks. We find that it is better at understanding non-Latin script languages than generating them. It is able to generate multimodal content from textual prompts, via an intermediate code generation step. Moreover, we find that ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning, hence making it an unreliable reasoner. It is, for example, better at deductive than inductive reasoning. ChatGPT suffers from hallucination problems like other LLMs and it generates more extrinsic hallucinations from its parametric memory as it does not have access to an external knowledge base. Finally, the interactive feature of ChatGPT enables human collaboration with the underlying LLM to improve its performance, i.e, 8% ROUGE-1 on summarization and 2% ChrF++ on machine translation, in a multi-turn "prompt engineering" fashion. We also release codebase for evaluation set extraction.
翻译:本文提出一个框架,用于利用公开数据集对ChatGPT等交互式大型语言模型(LLM)进行定量评估。我们基于覆盖8种常见自然语言处理(NLP)应用任务的23个数据集,对ChatGPT进行了全面的技术评估。结合这些数据集与新设计的多模态数据集,我们评估了ChatGPT在任务、语言与模态维度上的表现。研究发现:ChatGPT在零样本学习场景下,多数任务的表现优于其他LLM,部分任务甚至超过微调模型;其对非拉丁字母语言的理解能力优于生成能力;通过中间代码生成步骤,ChatGPT能根据文本提示生成多模态内容。此外,逻辑推理、非文本推理与常识推理三类共10种推理任务中,ChatGPT平均准确率为63.41%,因而并不可靠——例如,其演绎推理能力强于归纳推理。与其他LLM类似,ChatGPT存在幻觉问题,因其无法访问外部知识库,故参数化记忆生成的外在幻觉更多。最后,ChatGPT的交互特性支持人类通过多轮“提示工程”方式与底层LLM协作,从而提升性能:摘要任务中ROUGE-1指标提升8%,机器翻译任务中ChrF++指标提升2%。我们还发布了评估集提取的代码库。