A Systematic Evaluation of Large Language Models on Out-of-Distribution Logical Reasoning Tasks

Large language models (LLMs), such as GPT-3.5 and GPT-4, have greatly advanced the performance of artificial systems on various natural language processing tasks to human-like levels. However, their generalisation and robustness to perform logical reasoning remain under-evaluated. To probe this ability, we propose three new logical reasoning datasets named "ReClor-plus", "LogiQA-plus" and "LogiQAv2-plus", each featuring three subsets: the first with randomly shuffled options, the second with the correct choices replaced by "none of the other options are correct", and a combination of the previous two subsets. We carry out experiments on these datasets with both discriminative and generative LLMs and show that these simple tricks greatly hinder the performance of the language models. Despite their superior performance on the original publicly available datasets, we find that all models struggle to answer our newly constructed datasets. We show that introducing task variations by perturbing a sizable training set can markedly improve the model's generalisation and robustness in logical reasoning tasks. Moreover, applying logic-driven data augmentation for fine-tuning, combined with prompting can enhance the generalisation performance of both discriminative large language models and generative large language models. These results offer insights into assessing and improving the generalisation and robustness of large language models for logical reasoning tasks. We make our source code and data publicly available \url{https://github.com/Strong-AI-Lab/Logical-and-abstract-reasoning}.

翻译：大型语言模型（LLMs），如GPT-3.5和GPT-4，已大幅提升人工智能系统在多种自然语言处理任务上的表现，达到类人水平。然而，其执行逻辑推理的泛化能力和鲁棒性仍缺乏充分评估。为探究这一能力，我们提出了三个新的逻辑推理数据集，分别命名为“ReClor-plus”、“LogiQA-plus”和“LogiQAv2-plus”，每个数据集包含三个子集：第一个子集随机打乱选项顺序，第二个子集将正确答案替换为“其他选项均不正确”，第三个子集结合了前两个子集的处理方式。我们利用判别式与生成式大语言模型在这些数据集上开展实验，结果表明这些简单技巧会显著降低语言模型的性能。尽管模型在原始公开数据集上表现优异，但我们发现所有模型均难以正确回答我们新构建的数据集。我们进一步证明，通过对大规模训练集进行扰动引入任务变体，可显著提升模型在逻辑推理任务中的泛化能力和鲁棒性。此外，结合逻辑驱动的数据增强进行微调，并配合提示工程，可同时提升判别式大语言模型与生成式大语言模型的泛化性能。这些结果为评估和改进大语言模型在逻辑推理任务中的泛化能力与鲁棒性提供了重要见解。我们已将源代码和数据公开于 \url{https://github.com/Strong-AI-Lab/Logical-and-abstract-reasoning}。