Acquiring factual knowledge with Pretrained Language Models (PLMs) has attracted increasing attention, showing promising performance in many knowledge-intensive tasks. Their good performance has led the community to believe that the models do possess a modicum of reasoning competence rather than merely memorising the knowledge. In this paper, we conduct a comprehensive evaluation of the learnable deductive (also known as explicit) reasoning capability of PLMs. Through a series of controlled experiments, we posit two main findings. (i) PLMs inadequately generalise learned logic rules and perform inconsistently against simple adversarial surface form edits. (ii) While the deductive reasoning fine-tuning of PLMs does improve their performance on reasoning over unseen knowledge facts, it results in catastrophically forgetting the previously learnt knowledge. Our main results suggest that PLMs cannot yet perform reliable deductive reasoning, demonstrating the importance of controlled examinations and probing of PLMs' reasoning abilities; we reach beyond (misleading) task performance, revealing that PLMs are still far from human-level reasoning capabilities, even for simple deductive tasks.
翻译:预训练语言模型(PLMs)获取事实性知识的能力日益受到关注,并在诸多知识密集型任务中展现出令人瞩目的表现。其优异性能使学界普遍认为模型具备一定的推理能力,而不仅仅是记忆知识。本文对PLMs的可学习演绎(亦称显式)推理能力进行了全面评估。通过一系列受控实验,我们得出两个主要发现:(i)PLMs未能充分泛化所学的逻辑规则,在应对简单的对抗性表层形式编辑时表现不一致;(ii)尽管对PLMs进行演绎推理微调能提升其对未知知识事实的推理性能,却会导致先前所学知识的灾难性遗忘。我们的核心结果表明,PLMs目前尚无法进行可靠的演绎推理,这揭示了对其推理能力进行受控检验与探查的重要性;我们突破了(具有误导性的)任务性能表象,证明即便是简单的演绎任务,PLMs仍远未达到人类水平的推理能力。