Inductive reasoning is an essential capability for large language models (LLMs) to achieve higher intelligence, which requires the model to generalize rules from observed facts and then apply them to unseen examples. We present MIRAGE, a synthetic dataset that addresses the limitations of previous work, specifically the lack of comprehensive evaluation and flexible test data. In it, we evaluate LLMs' capabilities in both the inductive and deductive stages, allowing for flexible variation in input distribution, task scenario, and task difficulty to analyze the factors influencing LLMs' inductive reasoning. Based on these multi-faceted evaluations, we demonstrate that the LLM is a poor rule-based reasoner. In many cases, when conducting inductive reasoning, they do not rely on a correct rule to answer the unseen case. From the perspectives of different prompting methods, observation numbers, and task forms, models tend to consistently conduct correct deduction without correct inductive rules. Besides, we find that LLMs are good neighbor-based reasoners. In the inductive reasoning process, the model tends to focus on observed facts that are close to the current test example in feature space. By leveraging these similar examples, the model maintains strong inductive capabilities within a localized region, significantly improving its deductive performance.
翻译:归纳推理是大型语言模型(LLM)实现更高智能的关键能力,要求模型从观察到的事实中归纳出规则,并将其应用于未见过的示例。本文提出MIRAGE——一个合成数据集,旨在解决以往研究的局限性,特别是评估不够全面和测试数据缺乏灵活性的问题。通过该数据集,我们评估了LLM在归纳与演绎两个阶段的能力,并允许对输入分布、任务场景和任务难度进行灵活调整,以分析影响LLM归纳推理的因素。基于这些多维度评估,我们证明LLM是一种较差的基于规则的推理器。在许多情况下,当进行归纳推理时,它们并未依赖正确的规则来回答未见案例。从不同提示方法、观察数量及任务形式的角度来看,模型倾向于在不依赖正确归纳规则的情况下持续进行正确演绎。此外,我们发现LLM是优秀的基于邻近样本的推理器。在归纳推理过程中,模型倾向于关注特征空间中与当前测试示例相近的已观察事实。通过利用这些相似示例,模型在局部区域内保持强大的归纳能力,从而显著提升其演绎表现。