While large language models (LLMs) have demonstrated impressive capabilities across various natural language processing tasks by acquiring rich factual knowledge from their broad training data, their ability to synthesize and logically reason with this knowledge in complex ways remains underexplored. In this work, we present a systematic evaluation of state-of-the-art LLMs' complex logical reasoning abilities through a novel benchmark of automatically generated complex reasoning questions over general domain and biomedical knowledge graphs. Our extensive experiments, employing diverse in-context learning techniques, reveal that LLMs excel at reasoning over general world knowledge but face significant challenges with specialized domain-specific knowledge. We find that prompting with explicit Chain-of-Thought demonstrations can substantially improve LLM performance on complex logical reasoning tasks with diverse logical operations. Interestingly, our controlled evaluations uncover an asymmetry where LLMs display proficiency at set union operations, but struggle considerably with set intersections - a key building block of logical reasoning. To foster further work, we will publicly release our evaluation benchmark and code.
翻译:尽管大型语言模型(LLMs)通过从广泛的训练数据中获取丰富的事实知识,在各种自然语言处理任务中展现了令人印象深刻的能力,但其以复杂方式综合并运用这些知识进行逻辑推理的能力仍未得到充分探索。本研究通过一个基于通用领域和生物医学知识图谱自动生成的复杂推理问题新基准,对前沿大型语言模型的复杂逻辑推理能力进行了系统评估。我们采用多样化的上下文学习技术进行了大量实验,结果表明:大型语言模型擅长对通用世界知识进行推理,但在处理专业领域特定知识时面临显著挑战。研究发现,通过提供显式的思维链示范进行提示,能显著提升大型语言模型在涉及多种逻辑运算的复杂逻辑推理任务上的表现。有趣的是,我们的对照实验揭示了一种不对称性:大型语言模型擅长处理集合并集运算,但在处理作为逻辑推理关键基础的集合交集运算时却存在明显困难。为促进后续研究,我们将公开本评估基准及相关代码。