Phenomenal Yet Puzzling: Testing Inductive Reasoning Capabilities of Language Models with Hypothesis Refinement

The ability to derive underlying principles from a handful of observations and then generalize to novel situations -- known as inductive reasoning -- is central to human intelligence. Prior work suggests that language models (LMs) often fall short on inductive reasoning, despite achieving impressive success on research benchmarks. In this work, we conduct a systematic study of the inductive reasoning capabilities of LMs through iterative hypothesis refinement, a technique that more closely mirrors the human inductive process than standard input-output prompting. Iterative hypothesis refinement employs a three-step process: proposing, selecting, and refining hypotheses in the form of textual rules. By examining the intermediate rules, we observe that LMs are phenomenal hypothesis proposers (i.e., generating candidate rules), and when coupled with a (task-specific) symbolic interpreter that is able to systematically filter the proposed set of rules, this hybrid approach achieves strong results across inductive reasoning benchmarks that require inducing causal relations, language-like instructions, and symbolic concepts. However, they also behave as puzzling inductive reasoners, showing notable performance gaps between rule induction (i.e., identifying plausible rules) and rule application (i.e., applying proposed rules to instances), suggesting that LMs are proposing hypotheses without being able to actually apply the rules. Through empirical and human analyses, we further reveal several discrepancies between the inductive reasoning processes of LMs and humans, shedding light on both the potentials and limitations of using LMs in inductive reasoning tasks.

翻译：从少量观察中推导出潜在原则并泛化至新情境的能力（即归纳推理）是人类智能的核心。先前研究表明，尽管语言模型（LMs）在标准化测试中取得显著成功，但其在归纳推理方面常存在不足。本研究通过迭代假设精炼技术系统探究LMs的归纳推理能力——该技术比标准输入输出提示更贴近人类归纳过程。迭代假设精炼包含三步流程：以文本规则形式提出、筛选并精炼假设。通过分析中间规则，我们发现LMs是出色的假设提出者（即生成候选规则），当与能够系统筛选规则集的（任务特定）符号解释器结合时，这种混合方法在需要归纳因果关系、类语言指令及符号概念的归纳推理基准测试中均取得优异表现。然而，LMs同时表现出令人困惑的归纳推理特性：在规则归纳（识别可行规则）与规则应用（将所提规则应用于实例）之间存在显著性能差距，表明LMs虽能提出假设却无法真正应用这些规则。通过实证实验与人类分析，我们进一步揭示LMs与人类在归纳推理过程中的多重差异，为LMs在归纳推理任务中的应用潜力与局限性提供了新见解。