Large Language Models (LLMs) have exhibited remarkable performance on various Natural Language Processing (NLP) tasks. However, there is a current hot debate regarding their reasoning capacity. In this paper, we examine the performance of GPT-3.5, GPT-4, and BARD models, by performing a thorough technical evaluation on different reasoning tasks across eleven distinct datasets. Our paper provides empirical evidence showcasing the superior performance of ChatGPT-4 in comparison to both ChatGPT-3.5 and BARD in zero-shot setting throughout almost all evaluated tasks. While the superiority of GPT-4 compared to GPT-3.5 might be explained by its larger size and NLP efficiency, this was not evident for BARD. We also demonstrate that the three models show limited proficiency in Inductive, Mathematical, and Multi-hop Reasoning Tasks. To bolster our findings, we present a detailed and comprehensive analysis of the results from these three models. Furthermore, we propose a set of engineered prompts that enhances the zero-shot setting performance of all three models.
翻译:大语言模型(LLMs)在各类自然语言处理(NLP)任务中展现出卓越性能。然而,关于其推理能力的讨论目前存在激烈争议。本文通过跨11个不同数据集的多种推理任务进行技术评估,系统考察了GPT-3.5、GPT-4和BARD模型的性能表现。研究提供的实证证据表明,在零样本环境下,ChatGPT-4在几乎所有评估任务中的表现均显著优于ChatGPT-3.5和BARD。尽管GPT-4相较于GPT-3.5的优势可归因于其更大规模与更高NLP效率,但这一结论并不适用于BARD。我们还发现,三个模型在归纳推理、数学推理与多跳推理任务中均存在能力局限。为强化研究结论,我们对三个模型的结果进行了详尽分析。此外,我们提出一组经过设计的提示词,成功提升了三个模型在零样本环境下的性能表现。