Large language models (LLMs) have a substantial capacity for high-level analogical reasoning: reproducing patterns in linear text that occur in their training data (zero-shot evaluation) or in the provided context (few-shot in-context learning). However, recent studies show that even the more advanced LLMs fail in scenarios that require reasoning over multiple objects or facts and making sequences of logical deductions. We propose a two-stage probabilistic inference paradigm, ThinkSum, which reasons over sets of objects or facts in a structured manner. In the first stage (Think - retrieval of associations), a LLM is queried in parallel over a set of phrases extracted from the prompt or an auxiliary model call. In the second stage (Sum - probabilistic inference or reasoning), the results of these queries are aggregated to make the final prediction. We demonstrate the possibilities and advantages of ThinkSum on the BIG-bench suite of LLM evaluation tasks, achieving improvements over the state of the art using GPT-family models on thirteen difficult tasks, often with far smaller model variants. We also compare and contrast ThinkSum with other proposed modifications to direct prompting of LLMs, such as variants of chain-of-thought prompting. Our results suggest that because the probabilistic inference in ThinkSum is performed outside of calls to the LLM, ThinkSum is less sensitive to prompt design, yields more interpretable predictions, and can be flexibly combined with latent variable models to extract structured knowledge from LLMs. Overall, our proposed paradigm represents a promising approach for enhancing the reasoning capabilities of LLMs.
翻译:大语言模型具有强大的高层次类比推理能力:能够重现训练数据中出现的线性文本模式(零样本评估)或提供的上下文模式(少样本上下文学习)。然而,近期研究表明,即便是最先进的大语言模型,在需要推理多个对象或事实并进行逻辑演绎序列的复杂场景中仍会失败。本文提出一种两阶段概率推理范式——ThinkSum,该方法以结构化方式对对象或事实集合进行推理。在第一阶段(Think——关联检索),对从提示或辅助模型调用中提取的短语集合并行查询大语言模型;第二阶段(Sum——概率推理或归纳),聚合这些查询结果以作出最终预测。我们在BIG-bench大语言模型评估任务套件上验证了ThinkSum的效能与优势,在十三项困难任务中使用GPT系列模型取得了超越当前最优水平的改进(往往使用更小的模型变体)。我们还将ThinkSum与直接提示大语言模型的其他改进方案(如思维链提示的变体)进行了对比。结果表明,由于ThinkSum中的概率推理在大语言模型调用外部执行,该方法对提示设计的敏感度更低、预测结果可解释性更强,并能灵活结合潜变量模型从大语言模型中提取结构化知识。总体而言,本文提出的范式为增强大语言模型的推理能力提供了一个有前景的研究方向。