The ability of Large Language Models (LLMs) to perform reasoning tasks such as deduction has been widely investigated in recent years. Yet, their capacity to generate proofs-faithful, human-readable explanations of why conclusions follow-remains largely under explored. In this work, we study proof generation in the context of OWL ontologies, which are widely adopted for representing and reasoning over complex knowledge, by developing an automated dataset construction and evaluation framework. Our evaluation encompassing three sequential tasks for complete proving: Extraction, Simplification, and Explanation, as well as an additional task of assessing Logic Completeness of the premise. Through extensive experiments on widely used reasoning LLMs, we achieve important findings including: (1) Some models achieve overall strong results but remain limited on complex cases; (2) Logical complexity, rather than representation format (formal logic language versus natural language), is the dominant factor shaping LLM performance; and (3) Noise and incompleteness in input data substantially diminish LLMs' performance. Together, these results underscore both the promise of LLMs for explanation with rigorous logics and the gap of supporting resilient reasoning under complex or imperfect conditions. Code and data are available at https://github.com/HuiYang1997/LLMOwlR.
翻译:近年来,大型语言模型(LLMs)执行推理任务(如演绎)的能力已得到广泛研究。然而,其生成证明——即忠实、可读性强的关于结论为何成立的解释——的能力在很大程度上仍未得到充分探索。在本研究中,我们通过开发自动化的数据集构建与评估框架,探讨了在OWL本体语境下的证明生成问题,OWL本体被广泛用于表示和推理复杂知识。我们的评估涵盖完整证明的三个顺序任务:提取、简化和解释,以及一个额外的评估前提逻辑完备性的任务。通过对广泛使用的推理型LLMs进行大量实验,我们获得了重要发现,包括:(1)某些模型在整体上表现强劲,但在复杂案例中仍存在局限;(2)逻辑复杂性,而非表示形式(形式逻辑语言与自然语言),是影响LLM性能的主导因素;(3)输入数据中的噪声和不完整性会显著降低LLMs的性能。这些结果共同表明,LLMs在提供严格逻辑解释方面具有潜力,但在支持复杂或不完美条件下的稳健推理方面仍存在差距。代码和数据可在 https://github.com/HuiYang1997/LLMOwlR 获取。