Despite the remarkable progress in neural models, their ability to generalize, a cornerstone for applications such as logical reasoning, remains a critical challenge. We delineate two fundamental aspects of this ability: compositionality, the capacity to abstract atomic logical rules underlying complex inferences, and recursiveness, the aptitude to build intricate representations through iterative application of inference rules. In the literature, these two aspects are often conflated under the umbrella term of generalization. To sharpen this distinction, we investigate the logical generalization capabilities of LLMs using the syllogistic fragment as a benchmark for natural language reasoning. We extend classical syllogistic forms to construct more complex structures, yielding a foundational yet expressive subset of formal logic that supports controlled evaluation of essential reasoning abilities. Our findings on this non-trivial benchmark show that, while LLMs demonstrate reasonable proficiency in recursiveness, they struggle with compositionality. This disparity is not uniform, as a more detailed analysis reveals substantial variability in generalization performance across individual syllogistic types, ranging from near-perfect accuracy to significantly lower performance. To overcome these limitations and establish a reliable logical prover, we propose a hybrid architecture integrating symbolic reasoning with neural computation. This synergistic interaction enables robust and efficient inference, neural components accelerate processing, while symbolic reasoning guarantees completeness. Our experiments further show that high efficiency is preserved even when using relatively small neural components. Overall, our analysis provides both a rationale for hybrid neuro-symbolic approaches and evidence of their potential to address key generalization barriers in neural reasoning systems.
翻译:尽管神经模型取得了显著进展,但其泛化能力——作为逻辑推理等应用的关键基础——仍面临严峻挑战。我们明确了这一能力的两个基本方面:组合性,即从复杂推理中抽象出原子逻辑规则的能力;以及递归性,即通过推理规则的迭代应用构建复杂表征的能力。文献中常将这两方面笼统归为泛化能力。为厘清这一区分,我们以三段论片段为自然语言推理基准,探究大型语言模型(LLMs)的逻辑泛化能力。我们扩展了经典三段论形式以构建更复杂的结构,形成了形式逻辑中基础且具表达力的子集,支持对核心推理能力开展受控评估。在这一具有挑战性的基准上的结果表明,LLMs虽在递归性上展现出合理水平,但在组合性方面表现欠佳。这种差异并非均匀分布,更细致的分析揭示不同三段论类型的泛化表现存在显著变异,从近乎完美的准确率到明显低下的性能不等。为克服这些局限并构建可靠的逻辑证明器,我们提出了一种融合符号推理与神经计算的混合架构。这种协同交互能够实现稳健高效的推理:神经组件加速处理,而符号推理则保障完备性。实验进一步表明,即使使用相对较小的神经组件,仍能保持高推理效率。总体而言,我们的分析既为混合神经符号方法提供了理论依据,也证明了其在解决神经推理系统关键泛化障碍方面的潜力。