Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4

Harnessing logical reasoning ability is a comprehensive natural language understanding endeavor. With the release of Generative Pretrained Transformer 4 (GPT-4), highlighted as "advanced" at reasoning tasks, we are eager to learn the GPT-4 performance on various logical reasoning tasks. This report analyses multiple logical reasoning datasets, with popular benchmarks like LogiQA and ReClor, and newly-released datasets like AR-LSAT. We test the multi-choice reading comprehension and natural language inference tasks with benchmarks requiring logical reasoning. We further construct a logical reasoning out-of-distribution dataset to investigate the robustness of ChatGPT and GPT-4. We also make a performance comparison between ChatGPT and GPT-4. Experiment results show that ChatGPT performs significantly better than the RoBERTa fine-tuning method on most logical reasoning benchmarks. With early access to the GPT-4 API we are able to conduct intense experiments on the GPT-4 model. The results show GPT-4 yields even higher performance on most logical reasoning datasets. Among benchmarks, ChatGPT and GPT-4 do relatively well on well-known datasets like LogiQA and ReClor. However, the performance drops significantly when handling newly released and out-of-distribution datasets. Logical reasoning remains challenging for ChatGPT and GPT-4, especially on out-of-distribution and natural language inference datasets. We release the prompt-style logical reasoning datasets as a benchmark suite and name it LogiEval.

翻译：逻辑推理能力是自然语言理解领域的一项综合性挑战。随着第四代生成式预训练Transformer（GPT-4）的发布——其被强调在推理任务中具备"先进"能力——我们迫切希望探究GPT-4在各类逻辑推理任务中的表现。本报告分析了多个逻辑推理数据集，包括LogiQA、ReClor等主流基准，以及AR-LSAT等新发布的数据集。我们通过需要逻辑推理的基准测试，检验了模型在多选题阅读理解与自然语言推理任务中的表现。进一步地，我们构建了一个逻辑推理分布外数据集，用以考察ChatGPT与GPT-4的鲁棒性，并对二者进行了性能对比。实验结果表明，在大多数逻辑推理基准上，ChatGPT的性能显著优于基于RoBERTa的微调方法。借助GPT-4 API的早期访问权限，我们得以对GPT-4模型展开密集实验。结果显示，GPT-4在多数逻辑推理数据集上取得了更高的性能。在基准测试中，ChatGPT与GPT-4在LogiQA、ReClor等知名数据集上表现相对良好；然而，当处理新发布的数据集及分布外数据集时，其性能显著下降。逻辑推理对ChatGPT与GPT-4而言仍具挑战性，尤其是在分布外与自然语言推理数据集上。我们将这些提示式逻辑推理数据集作为基准套件发布，并命名为LogiEval。

相关内容

GPT-4

关注 29

北京时间2023年3月15日凌晨，ChatGPT开发商OpenAI 发布了发布了全新的多模态预训练大模型 GPT-4，可以更可靠、更具创造力、能处理更细节的指令，根据图片和文字提示都能生成相应内容。具体来说来说，GPT-4 相比上一代的模型，实现了飞跃式提升：支持图像和文本输入，拥有强大的识图能力；大幅提升了文字输入限制，在ChatGPT模式下，GPT-4可以处理超过2.5万字的文本，可以处理一些更加细节的指令；回答准确性也得到了显著提高。

LLM in Medical Domain: 大语言模型在医学领域的应用

专知会员服务

103+阅读 · 2023年6月17日

最新《Transformers模型》教程，64页ppt

专知会员服务

326+阅读 · 2020年11月26日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日