Hypothesis Search: Inductive Reasoning with Language Models

Inductive reasoning is a core problem-solving capacity: humans can identify underlying principles from a few examples, which can then be robustly generalized to novel scenarios. Recent work has evaluated large language models (LLMs) on inductive reasoning tasks by directly prompting them yielding "in context learning." This can work well for straightforward inductive tasks, but performs very poorly on more complex tasks such as the Abstraction and Reasoning Corpus (ARC). In this work, we propose to improve the inductive reasoning ability of LLMs by generating explicit hypotheses at multiple levels of abstraction: we prompt the LLM to propose multiple abstract hypotheses about the problem, in natural language, then implement the natural language hypotheses as concrete Python programs. These programs can be directly verified by running on the observed examples and generalized to novel inputs. Because of the prohibitive cost of generation with state-of-the-art LLMs, we consider a middle step to filter the set of hypotheses that will be implemented into programs: we either ask the LLM to summarize into a smaller set of hypotheses, or ask human annotators to select a subset of the hypotheses. We verify our pipeline's effectiveness on the ARC visual inductive reasoning benchmark, its variant 1D-ARC, and string transformation dataset SyGuS. On a random 40-problem subset of ARC, our automated pipeline using LLM summaries achieves 27.5% accuracy, significantly outperforming the direct prompting baseline (accuracy of 12.5%). With the minimal human input of selecting from LLM-generated candidates, the performance is boosted to 37.5%. (And we argue this is a lower bound on the performance of our approach without filtering.) Our ablation studies show that abstract hypothesis generation and concrete program representations are both beneficial for LLMs to perform inductive reasoning tasks.

翻译：归纳推理是核心问题解决能力：人类能从少量示例中识别潜在原理，并鲁棒地泛化到新场景。近期研究通过直接提示大语言模型（LLMs）执行"上下文学习"来评估其归纳推理能力。这种方法对简单的归纳任务效果良好，但在更复杂的任务（如抽象推理语料库ARC）中表现极差。本研究提出通过生成多层次抽象显式假设来提升LLM的归纳推理能力：我们引导LLM以自然语言形式提出多个关于问题的抽象假设，随后将这些自然语言假设实现为具体Python程序。这些程序可直接在观测样本上运行验证，并泛化到新输入。考虑到最先进LLM的生成成本过高，我们引入中间步骤对需实现的假设集进行过滤：既可要求LLM将假设归纳为更小子集，也可由人工标注员选择部分假设。我们在ARC视觉归纳推理基准、其变体1D-ARC及字符串变换数据集SyGuS上验证了该流程的有效性。在ARC随机选取的40题子集中，使用LLM摘要的自动化流程达到27.5%的准确率，显著优于直接提示基线（准确率12.5%）。通过最小化人工输入（从LLM生成候选集选择），性能提升至37.5%。（我们认为这是本方法未经过滤时的性能下限）。消融研究表明，抽象假设生成与具体程序表示均有利于LLM执行归纳推理任务。