Large language models (LLMs) have gained enormous attention from both academia and industry, due to their exceptional ability in language generation and extremely powerful generalization. However, current LLMs still output unreliable content in practical reasoning tasks due to their inherent issues (e.g., hallucination). To better disentangle this problem, in this paper, we conduct an in-depth investigation to systematically explore the capability of LLMs in logical reasoning. More in detail, we first investigate the deficiency of LLMs in logical reasoning on different tasks, including event relation extraction and deductive reasoning. Our study demonstrates that LLMs are not good reasoners in solving tasks with rigorous reasoning and will produce counterfactual answers, which require us to iteratively refine. Therefore, we comprehensively explore different strategies to endow LLMs with logical reasoning ability, and thus enable them to generate more logically consistent answers across different scenarios. Based on our approach, we also contribute a synthesized dataset (LLM-LR) involving multi-hop reasoning for evaluation and pre-training. Extensive quantitative and qualitative analyses on different tasks also validate the effectiveness and necessity of teaching LLMs with logic and provide insights for solving practical tasks with LLMs in future work.
翻译:大型语言模型因其在语言生成方面的卓越能力和极其强大的泛化能力,从学术界和工业界都获得了巨大关注。然而,当前的大型语言模型在实际推理任务中仍会输出不可靠的内容,原因在于其固有的问题(例如,幻觉)。为了更好地厘清这一问题,本文进行了深入调查,系统性地探索了大型语言模型在逻辑推理方面的能力。更具体地说,我们首先研究了大型语言模型在不同任务(包括事件关系抽取和演绎推理)上的逻辑推理缺陷。我们的研究表明,大型语言模型并非良好的推理者,无法解决需要严谨推理的任务,并会产生反事实答案,这需要我们迭代式地进行优化。因此,我们全面探索了赋予大型语言模型逻辑推理能力的不同策略,从而使它们能够在不同场景下生成逻辑上更一致的答案。基于我们的方法,我们还贡献了一个合成的数据集(LLM-LR),该数据集涉及多跳推理,可用于评估和预训练。在不同任务上进行的广泛定量和定性分析也验证了用逻辑教大型语言模型的有效性和必要性,并为未来利用大型语言模型解决实际任务提供了见解。