Harnessing logical reasoning ability is a comprehensive natural language understanding endeavor. With the release of Generative Pretrained Transformer 4 (GPT-4), highlighted as "advanced" at reasoning tasks, we are eager to learn the GPT-4 performance on various logical reasoning tasks. This report analyses multiple logical reasoning datasets, with popular benchmarks like LogiQA and ReClor, and newly-released datasets like AR-LSAT. We test the multi-choice reading comprehension and natural language inference tasks with benchmarks requiring logical reasoning. We further construct a logical reasoning out-of-distribution dataset to investigate the robustness of ChatGPT and GPT-4. We also make a performance comparison between ChatGPT and GPT-4. Experiment results show that ChatGPT performs significantly better than the RoBERTa fine-tuning method on most logical reasoning benchmarks. GPT-4 shows even higher performance on our manual tests. Among benchmarks, ChatGPT and GPT-4 do relatively well on well-known datasets like LogiQA and ReClor. However, the performance drops significantly when handling newly released and out-of-distribution datasets. Logical reasoning remains challenging for ChatGPT and GPT-4, especially on out-of-distribution and natural language inference datasets.
翻译:利用逻辑推理能力是一项综合性的自然语言理解任务。随着被强调为“高级”推理任务的生成式预训练Transformer 4(GPT-4)的发布,我们迫切希望了解GPT-4在各种逻辑推理任务上的表现。本报告分析了多个逻辑推理数据集,包括LogiQA与ReClor等流行基准,以及AR-LSAT等新发布数据集。我们通过需要逻辑推理的基准测试,检验了多项选择阅读理解与自然语言推理任务。此外,我们还构建了一个逻辑推理的分布外数据集,以探究ChatGPT与GPT-4的鲁棒性,并对二者进行了性能对比。实验结果表明,在大多数逻辑推理基准测试中,ChatGPT的表现显著优于基于RoBERTA的微调方法。GPT-4在人工测试中展现出更高的性能。在各类基准测试中,ChatGPT与GPT-4在LogiQA与ReClor等知名数据集上表现相对较好。然而,在处理新发布及分布外数据集时,其性能显著下降。对于ChatGPT与GPT-4而言,逻辑推理仍然充满挑战,尤其是在分布外及自然语言推理数据集上。