Evaluations of large language models (LLMs) primarily emphasize convergent logical reasoning, where success is defined by producing a single correct proof. However, many real-world reasoning problems admit multiple valid derivations, requiring models to explore diverse logical paths rather than committing to one route. To address this limitation, we introduce LogicGraph, the first benchmark aimed to systematically evaluate multi-path logical reasoning, constructed via a neuro-symbolic framework that leverages backward logic generation and semantic instantiation. This pipeline yields solver-verified reasoning problems formalized by high-depth multi-path reasoning and inherent logical distractions, where each instance is associated with an exhaustive set of minimal proofs. We further propose a reference-free evaluation framework to rigorously assess model performance in both convergent and divergent regimes. Experiments on state-of-the-art language models reveal a common limitation: models tend to commit early to a single route and fail to explore alternatives, and the coverage gap grows substantially with reasoning depth. LogicGraph exposes this divergence gap and provides actionable insights to motivate future improvements. Our code and data will be released at https://github.com/kkkkarry/LogicGraph.
翻译:当前对大语言模型(LLM)的评估主要聚焦于收敛式逻辑推理,其成功标准是生成单一正确证明。然而,许多现实世界的推理问题允许多种有效推导路径,要求模型能够探索多样化的逻辑路径而非固守单一路线。为应对这一局限,我们提出了LogicGraph——首个旨在系统评估多路径逻辑推理的基准,该基准通过结合逆向逻辑生成与语义实例化的神经符号框架构建。该流程可生成经求解器验证的推理问题,这些问题通过高深度多路径推理与固有逻辑干扰进行形式化描述,且每个实例均关联一组完备的最小证明集合。我们进一步提出一种无参考评估框架,以严格评估模型在收敛与发散两种模式下的性能。在先进语言模型上的实验揭示了一个普遍局限:模型倾向于过早锁定单一推理路径而未能探索替代方案,且随着推理深度的增加,其路径覆盖差距显著扩大。LogicGraph揭示了这种发散性差距,并为推动未来改进提供了可操作的见解。我们的代码与数据将在 https://github.com/kkkkarry/LogicGraph 发布。