Large Language Models (LLMs) have been found to struggle with systematic reasoning. Even on tasks where they appear to perform well, their performance often depends on shortcuts, rather than on genuine reasoning abilities, leading them to collapse on out-of-distribution examples. Post-training strategies based on reinforcement learning and chain-of-thought prompting have recently been hailed as a step change. However, little is still known about the potential of the resulting ``Large Reasoning Models'' (LRMs) beyond problem solving in mathematics and programming, where finding genuine out-of-distribution problems can be difficult. In this paper, we focus on tasks that require systematic reasoning about relational compositions, especially for qualitative spatial and temporal reasoning. These tasks allow us to control the difficulty of problem instances, and measure in a precise way to what extent models can generalise. We find that that the considered LLMs and LRMs overall perform poorly overall, albeit better than random chance.
翻译:大型语言模型(LLMs)已被发现难以进行系统性推理。即使在看似表现良好的任务上,其性能往往依赖于捷径而非真正的推理能力,导致它们在分布外样本上表现崩溃。基于强化学习和思维链提示的训练后策略近期被誉为一项重要突破。然而,对于由此产生的"大型推理模型"(LRMs)在数学和编程问题解决之外的潜力,目前仍知之甚少,且在这些领域寻找真正的分布外问题本身具有挑战性。本文聚焦于需要系统性关系组合推理的任务,特别是定性时空推理。这些任务使我们能够精确控制问题实例的难度,并准确衡量模型的泛化能力。研究发现,尽管所考察的LLMs和LRMs整体表现优于随机水平,但其系统性推理能力仍然薄弱。