Large language models (LLMs) have revolutionized many areas (e.g. natural language processing, software engineering, etc.) by achieving state-of-the-art performance on extensive downstream tasks. Aiming to achieve robust and general artificial intelligence, there has been a surge of interest in investigating the reasoning ability of the LLMs. Whereas the textual and numerical reasoning benchmarks adopted by previous works are rather shallow and simple, it is hard to conclude that the LLMs possess strong reasoning ability by merely achieving positive results on these benchmarks. Recent efforts have demonstrated that the LLMs are poor at solving sequential decision-making problems that require common-sense planning by evaluating their performance on the reinforcement learning benchmarks. In this work, we conduct an in-depth assessment of several state-of-the-art LLMs' reasoning ability based on the inductive logic programming (ILP) benchmark, which is broadly recognized as a representative and challenging measurement for evaluating logic program induction/synthesis systems as it requires inducing strict cause-effect logic to achieve robust deduction on independent and identically distributed (IID) and out-of-distribution (OOD) test samples. Our evaluations illustrate that compared with the neural program induction systems which are much smaller in model size, the state-of-the-art LLMs are much poorer in terms of reasoning ability by achieving much lower performance and generalization using either natural language prompting or truth-value matrix prompting.
翻译:大语言模型(LLMs)通过在广泛的下游任务中实现最先进的性能,彻底改变了多个领域(如自然语言处理、软件工程等)。为追求鲁棒且通用的人工智能,研究者对LLMs的推理能力产生了浓厚兴趣。然而,以往工作采用的文本和数值推理基准相对浅显简单,仅凭在这些基准上取得积极结果,难以断定LLMs具备强大的推理能力。近期研究表明,通过评估LLMs在强化学习基准上的表现,发现其在需要常识规划的序列决策问题上表现不佳。本研究基于归纳逻辑编程(ILP)基准,对多个最先进LLMs的推理能力进行了深入评估。ILP被广泛视为评估逻辑程序归纳/合成系统的代表性且具有挑战性的度量,因为它要求归纳出严格的因果逻辑,以实现对独立同分布(IID)和分布外(OOD)测试样本的鲁棒演绎。我们的评估表明,与模型规模小得多的神经程序归纳系统相比,最先进的LLMs在推理能力上明显逊色——无论采用自然语言提示还是真值矩阵提示,其性能和泛化能力均显著较低。